2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* CDDL HEADER START
|
|
|
|
*
|
|
|
|
* The contents of this file are subject to the terms of the
|
|
|
|
* Common Development and Distribution License (the "License").
|
|
|
|
* You may not use this file except in compliance with the License.
|
|
|
|
*
|
|
|
|
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
|
|
|
|
* or http://www.opensolaris.org/os/licensing.
|
|
|
|
* See the License for the specific language governing permissions
|
|
|
|
* and limitations under the License.
|
|
|
|
*
|
|
|
|
* When distributing Covered Code, include this CDDL HEADER in each
|
|
|
|
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
|
|
|
|
* If applicable, add the following below this CDDL HEADER, with the
|
|
|
|
* fields enclosed by brackets "[]" replaced with your own identifying
|
|
|
|
* information: Portions Copyright [yyyy] [name of copyright owner]
|
|
|
|
*
|
|
|
|
* CDDL HEADER END
|
|
|
|
*/
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
|
2018-09-06 04:33:36 +03:00
|
|
|
* Copyright (c) 2011, 2018 by Delphix. All rights reserved.
|
2011-11-12 02:07:54 +04:00
|
|
|
* Copyright 2011 Nexenta Systems, Inc. All rights reserved.
|
2013-05-25 06:06:23 +04:00
|
|
|
* Copyright (c) 2013 Steven Hartland. All rights reserved.
|
2017-06-13 12:16:45 +03:00
|
|
|
* Copyright (c) 2014 Integros [integros.com]
|
|
|
|
* Copyright 2017 Joyent, Inc.
|
2018-09-06 04:33:36 +03:00
|
|
|
* Copyright (c) 2017, Intel Corporation.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The objective of this program is to provide a DMU/ZAP/SPA stress test
|
|
|
|
* that runs entirely in userland, is easy to use, and easy to extend.
|
|
|
|
*
|
|
|
|
* The overall design of the ztest program is as follows:
|
|
|
|
*
|
|
|
|
* (1) For each major functional area (e.g. adding vdevs to a pool,
|
|
|
|
* creating and destroying datasets, reading and writing objects, etc)
|
|
|
|
* we have a simple routine to test that functionality. These
|
|
|
|
* individual routines do not have to do anything "stressful".
|
|
|
|
*
|
|
|
|
* (2) We turn these simple functionality tests into a stress test by
|
|
|
|
* running them all in parallel, with as many threads as desired,
|
|
|
|
* and spread across as many datasets, objects, and vdevs as desired.
|
|
|
|
*
|
|
|
|
* (3) While all this is happening, we inject faults into the pool to
|
|
|
|
* verify that self-healing data really works.
|
|
|
|
*
|
|
|
|
* (4) Every time we open a dataset, we change its checksum and compression
|
|
|
|
* functions. Thus even individual objects vary from block to block
|
|
|
|
* in which checksum they use and whether they're compressed.
|
|
|
|
*
|
|
|
|
* (5) To verify that we never lose on-disk consistency after a crash,
|
|
|
|
* we run the entire test in a child of the main process.
|
|
|
|
* At random times, the child self-immolates with a SIGKILL.
|
|
|
|
* This is the software equivalent of pulling the power cord.
|
|
|
|
* The parent then runs the test again, using the existing
|
2014-06-06 01:19:08 +04:00
|
|
|
* storage pool, as many times as desired. If backwards compatibility
|
2012-01-24 06:43:32 +04:00
|
|
|
* testing is enabled ztest will sometimes run the "older" version
|
|
|
|
* of ztest after a SIGKILL.
|
2008-11-20 23:01:55 +03:00
|
|
|
*
|
|
|
|
* (6) To verify that we don't have future leaks or temporal incursions,
|
|
|
|
* many of the functional tests record the transaction group number
|
|
|
|
* as part of their data. When reading old data, they verify that
|
|
|
|
* the transaction group number is less than the current, open txg.
|
|
|
|
* If you add a new test, please do this if applicable.
|
|
|
|
*
|
2010-08-26 21:43:27 +04:00
|
|
|
* (7) Threads are created with a reduced stack size, for sanity checking.
|
|
|
|
* Therefore, it's important not to allocate huge buffers on the stack.
|
|
|
|
*
|
2008-11-20 23:01:55 +03:00
|
|
|
* When run with no arguments, ztest runs for about five minutes and
|
|
|
|
* produces no output if successful. To get a little bit of information,
|
|
|
|
* specify -V. To get more information, specify -VV, and so on.
|
|
|
|
*
|
|
|
|
* To turn this into an overnight stress test, use -T to specify run time.
|
|
|
|
*
|
2019-08-30 19:43:30 +03:00
|
|
|
* You can ask more vdevs [-v], datasets [-d], or threads [-t]
|
2008-11-20 23:01:55 +03:00
|
|
|
* to increase the pool capacity, fanout, and overall stress level.
|
|
|
|
*
|
2012-01-24 06:43:32 +04:00
|
|
|
* Use the -k option to set the desired frequency of kills.
|
|
|
|
*
|
|
|
|
* When ztest invokes itself it passes all relevant information through a
|
|
|
|
* temporary file which is mmap-ed in the child process. This allows shared
|
|
|
|
* memory to survive the exec syscall. The ztest_shared_hdr_t struct is always
|
|
|
|
* stored at offset 0 of this file and contains information on the size and
|
|
|
|
* number of shared structures in the file. The information stored in this file
|
|
|
|
* must remain backwards compatible with older versions of ztest so that
|
|
|
|
* ztest can invoke them during backwards compatibility testing (-B).
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
|
|
|
|
#include <sys/zfs_context.h>
|
|
|
|
#include <sys/spa.h>
|
|
|
|
#include <sys/dmu.h>
|
|
|
|
#include <sys/txg.h>
|
2009-07-03 02:44:48 +04:00
|
|
|
#include <sys/dbuf.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <sys/zap.h>
|
|
|
|
#include <sys/dmu_objset.h>
|
|
|
|
#include <sys/poll.h>
|
|
|
|
#include <sys/stat.h>
|
|
|
|
#include <sys/time.h>
|
|
|
|
#include <sys/wait.h>
|
|
|
|
#include <sys/mman.h>
|
|
|
|
#include <sys/resource.h>
|
|
|
|
#include <sys/zio.h>
|
|
|
|
#include <sys/zil.h>
|
2010-05-29 00:45:14 +04:00
|
|
|
#include <sys/zil_impl.h>
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
#include <sys/vdev_draid.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <sys/vdev_impl.h>
|
2008-12-03 23:09:06 +03:00
|
|
|
#include <sys/vdev_file.h>
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
#include <sys/vdev_initialize.h>
|
2019-07-12 19:31:20 +03:00
|
|
|
#include <sys/vdev_raidz.h>
|
2019-03-29 19:13:20 +03:00
|
|
|
#include <sys/vdev_trim.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <sys/spa_impl.h>
|
2010-05-29 00:45:14 +04:00
|
|
|
#include <sys/metaslab_impl.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <sys/dsl_prop.h>
|
2009-07-03 02:44:48 +04:00
|
|
|
#include <sys/dsl_dataset.h>
|
2013-09-04 16:00:57 +04:00
|
|
|
#include <sys/dsl_destroy.h>
|
2010-05-29 00:45:14 +04:00
|
|
|
#include <sys/dsl_scan.h>
|
|
|
|
#include <sys/zio_checksum.h>
|
2020-07-30 02:35:33 +03:00
|
|
|
#include <sys/zfs_refcount.h>
|
2012-12-14 03:24:15 +04:00
|
|
|
#include <sys/zfeature.h>
|
2013-09-04 16:00:57 +04:00
|
|
|
#include <sys/dsl_userhold.h>
|
2016-07-22 18:52:49 +03:00
|
|
|
#include <sys/abd.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <stdio.h>
|
|
|
|
#include <stdlib.h>
|
|
|
|
#include <unistd.h>
|
|
|
|
#include <signal.h>
|
|
|
|
#include <umem.h>
|
|
|
|
#include <ctype.h>
|
|
|
|
#include <math.h>
|
|
|
|
#include <sys/fs/zfs.h>
|
2015-12-10 02:34:16 +03:00
|
|
|
#include <zfs_fletcher.h>
|
2010-05-29 00:45:14 +04:00
|
|
|
#include <libnvpair.h>
|
2018-11-05 22:22:33 +03:00
|
|
|
#include <libzutil.h>
|
2018-08-02 21:59:24 +03:00
|
|
|
#include <sys/crypto/icp.h>
|
2021-02-17 08:51:46 +03:00
|
|
|
#if (__GLIBC__ && !__UCLIBC__)
|
2014-10-11 05:05:54 +04:00
|
|
|
#include <execinfo.h> /* for backtrace() */
|
|
|
|
#endif
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-10-04 18:09:12 +04:00
|
|
|
static int ztest_fd_data = -1;
|
|
|
|
static int ztest_fd_rand = -1;
|
2012-01-24 06:43:32 +04:00
|
|
|
|
|
|
|
typedef struct ztest_shared_hdr {
|
|
|
|
uint64_t zh_hdr_size;
|
|
|
|
uint64_t zh_opts_size;
|
|
|
|
uint64_t zh_size;
|
|
|
|
uint64_t zh_stats_size;
|
|
|
|
uint64_t zh_stats_count;
|
|
|
|
uint64_t zh_ds_size;
|
|
|
|
uint64_t zh_ds_count;
|
|
|
|
} ztest_shared_hdr_t;
|
|
|
|
|
|
|
|
static ztest_shared_hdr_t *ztest_shared_hdr;
|
|
|
|
|
2018-09-06 04:33:36 +03:00
|
|
|
enum ztest_class_state {
|
|
|
|
ZTEST_VDEV_CLASS_OFF,
|
|
|
|
ZTEST_VDEV_CLASS_ON,
|
|
|
|
ZTEST_VDEV_CLASS_RND
|
|
|
|
};
|
|
|
|
|
2021-02-16 13:14:44 +03:00
|
|
|
#define ZO_GVARS_MAX_ARGLEN ((size_t)64)
|
|
|
|
#define ZO_GVARS_MAX_COUNT ((size_t)10)
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
typedef struct ztest_shared_opts {
|
2016-06-16 00:28:36 +03:00
|
|
|
char zo_pool[ZFS_MAX_DATASET_NAME_LEN];
|
|
|
|
char zo_dir[ZFS_MAX_DATASET_NAME_LEN];
|
2012-01-24 06:43:32 +04:00
|
|
|
char zo_alt_ztest[MAXNAMELEN];
|
|
|
|
char zo_alt_libpath[MAXNAMELEN];
|
|
|
|
uint64_t zo_vdevs;
|
|
|
|
uint64_t zo_vdevtime;
|
|
|
|
size_t zo_vdev_size;
|
|
|
|
int zo_ashift;
|
|
|
|
int zo_mirrors;
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
int zo_raid_children;
|
|
|
|
int zo_raid_parity;
|
|
|
|
char zo_raid_type[8];
|
|
|
|
int zo_draid_data;
|
|
|
|
int zo_draid_spares;
|
2012-01-24 06:43:32 +04:00
|
|
|
int zo_datasets;
|
|
|
|
int zo_threads;
|
|
|
|
uint64_t zo_passtime;
|
|
|
|
uint64_t zo_killrate;
|
|
|
|
int zo_verbose;
|
|
|
|
int zo_init;
|
|
|
|
uint64_t zo_time;
|
|
|
|
uint64_t zo_maxloops;
|
2017-07-24 21:07:39 +03:00
|
|
|
uint64_t zo_metaslab_force_ganging;
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
int zo_mmp_test;
|
2018-09-06 04:33:36 +03:00
|
|
|
int zo_special_vdevs;
|
2018-10-15 22:14:22 +03:00
|
|
|
int zo_dump_dbgmsg;
|
2021-02-16 13:14:44 +03:00
|
|
|
int zo_gvars_count;
|
|
|
|
char zo_gvars[ZO_GVARS_MAX_COUNT][ZO_GVARS_MAX_ARGLEN];
|
2012-01-24 06:43:32 +04:00
|
|
|
} ztest_shared_opts_t;
|
|
|
|
|
|
|
|
static const ztest_shared_opts_t ztest_opts_defaults = {
|
2019-01-23 22:17:52 +03:00
|
|
|
.zo_pool = "ztest",
|
|
|
|
.zo_dir = "/tmp",
|
2012-01-24 06:43:32 +04:00
|
|
|
.zo_alt_ztest = { '\0' },
|
|
|
|
.zo_alt_libpath = { '\0' },
|
|
|
|
.zo_vdevs = 5,
|
|
|
|
.zo_ashift = SPA_MINBLOCKSHIFT,
|
|
|
|
.zo_mirrors = 2,
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
.zo_raid_children = 4,
|
|
|
|
.zo_raid_parity = 1,
|
|
|
|
.zo_raid_type = VDEV_TYPE_RAIDZ,
|
2017-01-12 22:52:56 +03:00
|
|
|
.zo_vdev_size = SPA_MINDEVSIZE * 4, /* 256m default size */
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
.zo_draid_data = 4, /* data drives */
|
|
|
|
.zo_draid_spares = 1, /* distributed spares */
|
2012-01-24 06:43:32 +04:00
|
|
|
.zo_datasets = 7,
|
|
|
|
.zo_threads = 23,
|
|
|
|
.zo_passtime = 60, /* 60 seconds */
|
|
|
|
.zo_killrate = 70, /* 70% kill rate */
|
|
|
|
.zo_verbose = 0,
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
.zo_mmp_test = 0,
|
2012-01-24 06:43:32 +04:00
|
|
|
.zo_init = 1,
|
|
|
|
.zo_time = 300, /* 5 minutes */
|
|
|
|
.zo_maxloops = 50, /* max loops during spa_freeze() */
|
2018-11-05 22:53:49 +03:00
|
|
|
.zo_metaslab_force_ganging = 64 << 10,
|
2018-09-06 04:33:36 +03:00
|
|
|
.zo_special_vdevs = ZTEST_VDEV_CLASS_RND,
|
2021-02-16 13:14:44 +03:00
|
|
|
.zo_gvars_count = 0,
|
2012-01-24 06:43:32 +04:00
|
|
|
};
|
|
|
|
|
2017-07-24 21:07:39 +03:00
|
|
|
extern uint64_t metaslab_force_ganging;
|
2012-01-24 06:43:32 +04:00
|
|
|
extern uint64_t metaslab_df_alloc_threshold;
|
2017-12-19 01:06:07 +03:00
|
|
|
extern unsigned long zfs_deadman_synctime_ms;
|
2014-06-13 03:29:11 +04:00
|
|
|
extern int metaslab_preload_limit;
|
2016-06-02 07:04:53 +03:00
|
|
|
extern boolean_t zfs_compressed_arc_enabled;
|
2018-01-19 12:19:47 +03:00
|
|
|
extern int zfs_abd_scatter_enabled;
|
|
|
|
extern int dmu_object_alloc_chunk_shift;
|
2017-08-04 19:30:49 +03:00
|
|
|
extern boolean_t zfs_force_some_double_word_sm_entries;
|
2018-08-29 21:33:33 +03:00
|
|
|
extern unsigned long zio_decompress_fail_fraction;
|
2018-10-01 20:36:34 +03:00
|
|
|
extern unsigned long zfs_reconstruct_indirect_damage_fraction;
|
2019-01-02 22:46:04 +03:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
|
|
|
|
static ztest_shared_opts_t *ztest_shared_opts;
|
|
|
|
static ztest_shared_opts_t ztest_opts;
|
2017-09-12 23:15:11 +03:00
|
|
|
static char *ztest_wkeydata = "abcdefghijklmnopqrstuvwxyz012345";
|
2012-01-24 06:43:32 +04:00
|
|
|
|
|
|
|
typedef struct ztest_shared_ds {
|
|
|
|
uint64_t zd_seq;
|
|
|
|
} ztest_shared_ds_t;
|
|
|
|
|
|
|
|
static ztest_shared_ds_t *ztest_shared_ds;
|
|
|
|
#define ZTEST_GET_SHARED_DS(d) (&ztest_shared_ds[d])
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
#define BT_MAGIC 0x123456789abcdefULL
|
2018-02-05 23:00:26 +03:00
|
|
|
#define MAXFAULTS(zs) \
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
(MAX((zs)->zs_mirrors, 1) * (ztest_opts.zo_raid_parity + 1) - 1)
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
enum ztest_io_type {
|
|
|
|
ZTEST_IO_WRITE_TAG,
|
|
|
|
ZTEST_IO_WRITE_PATTERN,
|
|
|
|
ZTEST_IO_WRITE_ZEROES,
|
|
|
|
ZTEST_IO_TRUNCATE,
|
|
|
|
ZTEST_IO_SETATTR,
|
2013-05-10 23:47:54 +04:00
|
|
|
ZTEST_IO_REWRITE,
|
2010-05-29 00:45:14 +04:00
|
|
|
ZTEST_IO_TYPES
|
|
|
|
};
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
typedef struct ztest_block_tag {
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t bt_magic;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t bt_objset;
|
|
|
|
uint64_t bt_object;
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
uint64_t bt_dnodesize;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t bt_offset;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t bt_gen;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t bt_txg;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t bt_crtxg;
|
2008-11-20 23:01:55 +03:00
|
|
|
} ztest_block_tag_t;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
typedef struct bufwad {
|
|
|
|
uint64_t bw_index;
|
|
|
|
uint64_t bw_txg;
|
|
|
|
uint64_t bw_data;
|
|
|
|
} bufwad_t;
|
|
|
|
|
2018-10-02 01:13:12 +03:00
|
|
|
/*
|
|
|
|
* It would be better to use a rangelock_t per object. Unfortunately
|
|
|
|
* the rangelock_t is not a drop-in replacement for rl_t, because we
|
|
|
|
* still need to map from object ID to rangelock_t.
|
|
|
|
*/
|
|
|
|
typedef enum {
|
|
|
|
RL_READER,
|
|
|
|
RL_WRITER,
|
|
|
|
RL_APPEND
|
|
|
|
} rl_type_t;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
typedef struct rll {
|
|
|
|
void *rll_writer;
|
|
|
|
int rll_readers;
|
2010-08-26 21:43:27 +04:00
|
|
|
kmutex_t rll_lock;
|
|
|
|
kcondvar_t rll_cv;
|
2010-05-29 00:45:14 +04:00
|
|
|
} rll_t;
|
|
|
|
|
2018-10-02 01:13:12 +03:00
|
|
|
typedef struct rl {
|
|
|
|
uint64_t rl_object;
|
|
|
|
uint64_t rl_offset;
|
|
|
|
uint64_t rl_size;
|
|
|
|
rll_t *rl_lock;
|
|
|
|
} rl_t;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
#define ZTEST_RANGE_LOCKS 64
|
|
|
|
#define ZTEST_OBJECT_LOCKS 64
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Object descriptor. Used as a template for object lookup/create/remove.
|
|
|
|
*/
|
|
|
|
typedef struct ztest_od {
|
|
|
|
uint64_t od_dir;
|
|
|
|
uint64_t od_object;
|
|
|
|
dmu_object_type_t od_type;
|
|
|
|
dmu_object_type_t od_crtype;
|
|
|
|
uint64_t od_blocksize;
|
|
|
|
uint64_t od_crblocksize;
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
uint64_t od_crdnodesize;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t od_gen;
|
|
|
|
uint64_t od_crgen;
|
2016-06-16 00:28:36 +03:00
|
|
|
char od_name[ZFS_MAX_DATASET_NAME_LEN];
|
2010-05-29 00:45:14 +04:00
|
|
|
} ztest_od_t;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Per-dataset state.
|
|
|
|
*/
|
|
|
|
typedef struct ztest_ds {
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_shared_ds_t *zd_shared;
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *zd_os;
|
2018-06-05 02:52:10 +03:00
|
|
|
pthread_rwlock_t zd_zilog_lock;
|
2010-05-29 00:45:14 +04:00
|
|
|
zilog_t *zd_zilog;
|
|
|
|
ztest_od_t *zd_od; /* debugging aid */
|
2016-06-16 00:28:36 +03:00
|
|
|
char zd_name[ZFS_MAX_DATASET_NAME_LEN];
|
2010-08-26 21:43:27 +04:00
|
|
|
kmutex_t zd_dirobj_lock;
|
2010-05-29 00:45:14 +04:00
|
|
|
rll_t zd_object_lock[ZTEST_OBJECT_LOCKS];
|
2018-10-02 01:13:12 +03:00
|
|
|
rll_t zd_range_lock[ZTEST_RANGE_LOCKS];
|
2010-05-29 00:45:14 +04:00
|
|
|
} ztest_ds_t;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Per-iteration state.
|
|
|
|
*/
|
|
|
|
typedef void ztest_func_t(ztest_ds_t *zd, uint64_t id);
|
|
|
|
|
|
|
|
typedef struct ztest_info {
|
|
|
|
ztest_func_t *zi_func; /* test function */
|
|
|
|
uint64_t zi_iters; /* iterations per execution */
|
|
|
|
uint64_t *zi_interval; /* execute every <interval> seconds */
|
2015-02-24 21:53:31 +03:00
|
|
|
const char *zi_funcname; /* name of test function */
|
2010-05-29 00:45:14 +04:00
|
|
|
} ztest_info_t;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
typedef struct ztest_shared_callstate {
|
|
|
|
uint64_t zc_count; /* per-pass count */
|
|
|
|
uint64_t zc_time; /* per-pass time */
|
|
|
|
uint64_t zc_next; /* next time to call this function */
|
|
|
|
} ztest_shared_callstate_t;
|
|
|
|
|
|
|
|
static ztest_shared_callstate_t *ztest_shared_callstate;
|
|
|
|
#define ZTEST_GET_SHARED_CALLSTATE(c) (&ztest_shared_callstate[c])
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
ztest_func_t ztest_dmu_read_write;
|
|
|
|
ztest_func_t ztest_dmu_write_parallel;
|
|
|
|
ztest_func_t ztest_dmu_object_alloc_free;
|
2018-01-19 12:19:47 +03:00
|
|
|
ztest_func_t ztest_dmu_object_next_chunk;
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_func_t ztest_dmu_commit_callbacks;
|
2008-11-20 23:01:55 +03:00
|
|
|
ztest_func_t ztest_zap;
|
|
|
|
ztest_func_t ztest_zap_parallel;
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_func_t ztest_zil_commit;
|
2011-07-26 23:41:53 +04:00
|
|
|
ztest_func_t ztest_zil_remount;
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_func_t ztest_dmu_read_write_zcopy;
|
2008-11-20 23:01:55 +03:00
|
|
|
ztest_func_t ztest_dmu_objset_create_destroy;
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_func_t ztest_dmu_prealloc;
|
|
|
|
ztest_func_t ztest_fzap;
|
2008-11-20 23:01:55 +03:00
|
|
|
ztest_func_t ztest_dmu_snapshot_create_destroy;
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_func_t ztest_dsl_prop_get_set;
|
|
|
|
ztest_func_t ztest_spa_prop_get_set;
|
2008-11-20 23:01:55 +03:00
|
|
|
ztest_func_t ztest_spa_create_destroy;
|
|
|
|
ztest_func_t ztest_fault_inject;
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_func_t ztest_dmu_snapshot_hold;
|
2017-07-21 03:54:26 +03:00
|
|
|
ztest_func_t ztest_mmp_enable_disable;
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_func_t ztest_scrub;
|
|
|
|
ztest_func_t ztest_dsl_dataset_promote_busy;
|
2008-11-20 23:01:55 +03:00
|
|
|
ztest_func_t ztest_vdev_attach_detach;
|
|
|
|
ztest_func_t ztest_vdev_LUN_growth;
|
|
|
|
ztest_func_t ztest_vdev_add_remove;
|
2018-09-06 04:33:36 +03:00
|
|
|
ztest_func_t ztest_vdev_class_add;
|
2008-12-03 23:09:06 +03:00
|
|
|
ztest_func_t ztest_vdev_aux_add_remove;
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_func_t ztest_split_pool;
|
2011-11-12 02:07:54 +04:00
|
|
|
ztest_func_t ztest_reguid;
|
2012-12-15 04:28:49 +04:00
|
|
|
ztest_func_t ztest_spa_upgrade;
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
ztest_func_t ztest_device_removal;
|
2016-12-17 01:11:29 +03:00
|
|
|
ztest_func_t ztest_spa_checkpoint_create_discard;
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
ztest_func_t ztest_initialize;
|
2019-03-29 19:13:20 +03:00
|
|
|
ztest_func_t ztest_trim;
|
2015-12-10 02:34:16 +03:00
|
|
|
ztest_func_t ztest_fletcher;
|
2016-09-23 04:52:29 +03:00
|
|
|
ztest_func_t ztest_fletcher_incr;
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_func_t ztest_verify_dnode_bt;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t zopt_always = 0ULL * NANOSEC; /* all the time */
|
|
|
|
uint64_t zopt_incessant = 1ULL * NANOSEC / 10; /* every 1/10 second */
|
|
|
|
uint64_t zopt_often = 1ULL * NANOSEC; /* every second */
|
|
|
|
uint64_t zopt_sometimes = 10ULL * NANOSEC; /* every 10 seconds */
|
|
|
|
uint64_t zopt_rarely = 60ULL * NANOSEC; /* every 60 seconds */
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-02-24 21:53:31 +03:00
|
|
|
#define ZTI_INIT(func, iters, interval) \
|
|
|
|
{ .zi_func = (func), \
|
|
|
|
.zi_iters = (iters), \
|
|
|
|
.zi_interval = (interval), \
|
|
|
|
.zi_funcname = # func }
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
ztest_info_t ztest_info[] = {
|
2015-02-24 21:53:31 +03:00
|
|
|
ZTI_INIT(ztest_dmu_read_write, 1, &zopt_always),
|
|
|
|
ZTI_INIT(ztest_dmu_write_parallel, 10, &zopt_always),
|
|
|
|
ZTI_INIT(ztest_dmu_object_alloc_free, 1, &zopt_always),
|
2018-01-19 12:19:47 +03:00
|
|
|
ZTI_INIT(ztest_dmu_object_next_chunk, 1, &zopt_sometimes),
|
2015-02-24 21:53:31 +03:00
|
|
|
ZTI_INIT(ztest_dmu_commit_callbacks, 1, &zopt_always),
|
|
|
|
ZTI_INIT(ztest_zap, 30, &zopt_always),
|
|
|
|
ZTI_INIT(ztest_zap_parallel, 100, &zopt_always),
|
|
|
|
ZTI_INIT(ztest_split_pool, 1, &zopt_always),
|
|
|
|
ZTI_INIT(ztest_zil_commit, 1, &zopt_incessant),
|
|
|
|
ZTI_INIT(ztest_zil_remount, 1, &zopt_sometimes),
|
|
|
|
ZTI_INIT(ztest_dmu_read_write_zcopy, 1, &zopt_often),
|
|
|
|
ZTI_INIT(ztest_dmu_objset_create_destroy, 1, &zopt_often),
|
|
|
|
ZTI_INIT(ztest_dsl_prop_get_set, 1, &zopt_often),
|
|
|
|
ZTI_INIT(ztest_spa_prop_get_set, 1, &zopt_sometimes),
|
2010-05-29 00:45:14 +04:00
|
|
|
#if 0
|
2015-02-24 21:53:31 +03:00
|
|
|
ZTI_INIT(ztest_dmu_prealloc, 1, &zopt_sometimes),
|
2010-05-29 00:45:14 +04:00
|
|
|
#endif
|
2015-02-24 21:53:31 +03:00
|
|
|
ZTI_INIT(ztest_fzap, 1, &zopt_sometimes),
|
|
|
|
ZTI_INIT(ztest_dmu_snapshot_create_destroy, 1, &zopt_sometimes),
|
|
|
|
ZTI_INIT(ztest_spa_create_destroy, 1, &zopt_sometimes),
|
|
|
|
ZTI_INIT(ztest_fault_inject, 1, &zopt_sometimes),
|
|
|
|
ZTI_INIT(ztest_dmu_snapshot_hold, 1, &zopt_sometimes),
|
2017-07-21 03:54:26 +03:00
|
|
|
ZTI_INIT(ztest_mmp_enable_disable, 1, &zopt_sometimes),
|
2015-02-24 21:53:31 +03:00
|
|
|
ZTI_INIT(ztest_reguid, 1, &zopt_rarely),
|
|
|
|
ZTI_INIT(ztest_scrub, 1, &zopt_rarely),
|
|
|
|
ZTI_INIT(ztest_spa_upgrade, 1, &zopt_rarely),
|
|
|
|
ZTI_INIT(ztest_dsl_dataset_promote_busy, 1, &zopt_rarely),
|
|
|
|
ZTI_INIT(ztest_vdev_attach_detach, 1, &zopt_sometimes),
|
|
|
|
ZTI_INIT(ztest_vdev_LUN_growth, 1, &zopt_rarely),
|
|
|
|
ZTI_INIT(ztest_vdev_add_remove, 1, &ztest_opts.zo_vdevtime),
|
2018-09-06 04:33:36 +03:00
|
|
|
ZTI_INIT(ztest_vdev_class_add, 1, &ztest_opts.zo_vdevtime),
|
2015-02-24 21:53:31 +03:00
|
|
|
ZTI_INIT(ztest_vdev_aux_add_remove, 1, &ztest_opts.zo_vdevtime),
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
ZTI_INIT(ztest_device_removal, 1, &zopt_sometimes),
|
2016-12-17 01:11:29 +03:00
|
|
|
ZTI_INIT(ztest_spa_checkpoint_create_discard, 1, &zopt_rarely),
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
ZTI_INIT(ztest_initialize, 1, &zopt_sometimes),
|
2019-03-29 19:13:20 +03:00
|
|
|
ZTI_INIT(ztest_trim, 1, &zopt_sometimes),
|
2015-12-10 02:34:16 +03:00
|
|
|
ZTI_INIT(ztest_fletcher, 1, &zopt_rarely),
|
2016-09-23 04:52:29 +03:00
|
|
|
ZTI_INIT(ztest_fletcher_incr, 1, &zopt_rarely),
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ZTI_INIT(ztest_verify_dnode_bt, 1, &zopt_sometimes),
|
2008-11-20 23:01:55 +03:00
|
|
|
};
|
|
|
|
|
|
|
|
#define ZTEST_FUNCS (sizeof (ztest_info) / sizeof (ztest_info_t))
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* The following struct is used to hold a list of uncalled commit callbacks.
|
|
|
|
* The callbacks are ordered by txg number.
|
|
|
|
*/
|
|
|
|
typedef struct ztest_cb_list {
|
2010-08-26 21:43:27 +04:00
|
|
|
kmutex_t zcl_callbacks_lock;
|
|
|
|
list_t zcl_callbacks;
|
2010-05-29 00:45:14 +04:00
|
|
|
} ztest_cb_list_t;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Stuff we need to share writably between parent and child.
|
|
|
|
*/
|
|
|
|
typedef struct ztest_shared {
|
2012-01-24 06:43:32 +04:00
|
|
|
boolean_t zs_do_init;
|
2010-05-29 00:45:14 +04:00
|
|
|
hrtime_t zs_proc_start;
|
|
|
|
hrtime_t zs_proc_stop;
|
|
|
|
hrtime_t zs_thread_start;
|
|
|
|
hrtime_t zs_thread_stop;
|
|
|
|
hrtime_t zs_thread_kill;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t zs_enospc_count;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t zs_vdev_next_leaf;
|
|
|
|
uint64_t zs_vdev_aux;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t zs_alloc;
|
|
|
|
uint64_t zs_space;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t zs_splits;
|
|
|
|
uint64_t zs_mirrors;
|
2012-01-24 06:43:32 +04:00
|
|
|
uint64_t zs_metaslab_sz;
|
|
|
|
uint64_t zs_metaslab_df_alloc_threshold;
|
|
|
|
uint64_t zs_guid;
|
2008-11-20 23:01:55 +03:00
|
|
|
} ztest_shared_t;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
#define ID_PARALLEL -1ULL
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static char ztest_dev_template[] = "%s/%s.%llua";
|
2008-12-03 23:09:06 +03:00
|
|
|
static char ztest_aux_template[] = "%s/%s.%s.%llu";
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_shared_t *ztest_shared;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
static spa_t *ztest_spa = NULL;
|
|
|
|
static ztest_ds_t *ztest_ds;
|
|
|
|
|
|
|
|
static kmutex_t ztest_vdev_lock;
|
OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.
There are two underlying problems:
1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.
The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).
2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.
Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.
The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).
The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.
This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).
Porting notes:
* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
to dsl_scan_needs_resilver(), which was added to ZoL as part of the
sequential scrub work.
* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
parameter. The extra parameter is unique to ZoL.
* When posting indirect checksum errors the ABD can be passed directly,
zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-02-13 22:37:56 +03:00
|
|
|
static boolean_t ztest_device_removal_active = B_FALSE;
|
2019-01-18 20:47:55 +03:00
|
|
|
static boolean_t ztest_pool_scrubbed = B_FALSE;
|
2016-12-17 01:11:29 +03:00
|
|
|
static kmutex_t ztest_checkpoint_lock;
|
2012-12-15 00:38:04 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The ztest_name_lock protects the pool and dataset namespace used by
|
|
|
|
* the individual tests. To modify the namespace, consumers must grab
|
|
|
|
* this lock as writer. Grabbing the lock as reader will ensure that the
|
|
|
|
* namespace does not change while the lock is held.
|
|
|
|
*/
|
2018-06-05 02:52:10 +03:00
|
|
|
static pthread_rwlock_t ztest_name_lock;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
static boolean_t ztest_dump_core = B_TRUE;
|
2008-12-03 23:09:06 +03:00
|
|
|
static boolean_t ztest_exiting;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/* Global commit callback list */
|
|
|
|
static ztest_cb_list_t zcl;
|
2010-08-26 21:17:18 +04:00
|
|
|
/* Commit cb delay */
|
|
|
|
static uint64_t zc_min_txg_delay = UINT64_MAX;
|
|
|
|
static int zc_cb_counter = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Minimum number of commit callbacks that need to be registered for us to check
|
|
|
|
* whether the minimum txg delay is acceptable.
|
|
|
|
*/
|
|
|
|
#define ZTEST_COMMIT_CB_MIN_REG 100
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If a number of txgs equal to this threshold have been created after a commit
|
|
|
|
* callback has been registered but not called, then we assume there is an
|
|
|
|
* implementation bug.
|
|
|
|
*/
|
|
|
|
#define ZTEST_COMMIT_CB_THRESH (TXG_CONCURRENT_STATES + 1000)
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
enum ztest_object {
|
|
|
|
ZTEST_META_DNODE = 0,
|
|
|
|
ZTEST_DIROBJ,
|
|
|
|
ZTEST_OBJECTS
|
|
|
|
};
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
static void usage(boolean_t) __NORETURN;
|
2019-01-18 20:47:55 +03:00
|
|
|
static int ztest_scrub_impl(spa_t *spa);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* These libumem hooks provide a reasonable set of defaults for the allocator's
|
|
|
|
* debugging facilities.
|
|
|
|
*/
|
|
|
|
const char *
|
2010-08-26 20:52:41 +04:00
|
|
|
_umem_debug_init(void)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
return ("default,verbose"); /* $UMEM_DEBUG setting */
|
|
|
|
}
|
|
|
|
|
|
|
|
const char *
|
|
|
|
_umem_logging_init(void)
|
|
|
|
{
|
|
|
|
return ("fail,contents"); /* $UMEM_LOGGING setting */
|
|
|
|
}
|
|
|
|
|
2017-12-19 01:06:07 +03:00
|
|
|
static void
|
|
|
|
dump_debug_buffer(void)
|
|
|
|
{
|
2018-10-15 22:14:22 +03:00
|
|
|
ssize_t ret __attribute__((unused));
|
|
|
|
|
|
|
|
if (!ztest_opts.zo_dump_dbgmsg)
|
2017-12-19 01:06:07 +03:00
|
|
|
return;
|
|
|
|
|
2018-10-15 22:14:22 +03:00
|
|
|
/*
|
|
|
|
* We use write() instead of printf() so that this function
|
|
|
|
* is safe to call from a signal handler.
|
|
|
|
*/
|
|
|
|
ret = write(STDOUT_FILENO, "\n", 1);
|
2017-12-19 01:06:07 +03:00
|
|
|
zfs_dbgmsg_print("ztest");
|
|
|
|
}
|
|
|
|
|
2014-10-11 05:05:54 +04:00
|
|
|
#define BACKTRACE_SZ 100
|
|
|
|
|
|
|
|
static void sig_handler(int signo)
|
|
|
|
{
|
|
|
|
struct sigaction action;
|
2021-02-17 08:51:46 +03:00
|
|
|
#if (__GLIBC__ && !__UCLIBC__) /* backtrace() is a GNU extension */
|
2014-10-11 05:05:54 +04:00
|
|
|
int nptrs;
|
|
|
|
void *buffer[BACKTRACE_SZ];
|
|
|
|
|
|
|
|
nptrs = backtrace(buffer, BACKTRACE_SZ);
|
|
|
|
backtrace_symbols_fd(buffer, nptrs, STDERR_FILENO);
|
|
|
|
#endif
|
2017-12-19 01:06:07 +03:00
|
|
|
dump_debug_buffer();
|
2014-10-11 05:05:54 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Restore default action and re-raise signal so SIGSEGV and
|
|
|
|
* SIGABRT can trigger a core dump.
|
|
|
|
*/
|
|
|
|
action.sa_handler = SIG_DFL;
|
|
|
|
sigemptyset(&action.sa_mask);
|
|
|
|
action.sa_flags = 0;
|
|
|
|
(void) sigaction(signo, &action, NULL);
|
|
|
|
raise(signo);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
#define FATAL_MSG_SZ 1024
|
|
|
|
|
|
|
|
char *fatal_msg;
|
|
|
|
|
|
|
|
static void
|
|
|
|
fatal(int do_perror, char *message, ...)
|
|
|
|
{
|
|
|
|
va_list args;
|
|
|
|
int save_errno = errno;
|
2010-08-26 22:13:05 +04:00
|
|
|
char *buf;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
(void) fflush(stdout);
|
2010-08-26 22:13:05 +04:00
|
|
|
buf = umem_alloc(FATAL_MSG_SZ, UMEM_NOFAIL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
va_start(args, message);
|
|
|
|
(void) sprintf(buf, "ztest: ");
|
|
|
|
/* LINTED */
|
|
|
|
(void) vsprintf(buf + strlen(buf), message, args);
|
|
|
|
va_end(args);
|
|
|
|
if (do_perror) {
|
|
|
|
(void) snprintf(buf + strlen(buf), FATAL_MSG_SZ - strlen(buf),
|
|
|
|
": %s", strerror(save_errno));
|
|
|
|
}
|
|
|
|
(void) fprintf(stderr, "%s\n", buf);
|
|
|
|
fatal_msg = buf; /* to ease debugging */
|
2017-12-19 01:06:07 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (ztest_dump_core)
|
|
|
|
abort();
|
2018-10-15 22:14:22 +03:00
|
|
|
else
|
|
|
|
dump_debug_buffer();
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
exit(3);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
str2shift(const char *buf)
|
|
|
|
{
|
|
|
|
const char *ends = "BKMGTPEZ";
|
|
|
|
int i;
|
|
|
|
|
|
|
|
if (buf[0] == '\0')
|
|
|
|
return (0);
|
|
|
|
for (i = 0; i < strlen(ends); i++) {
|
|
|
|
if (toupper(buf[0]) == ends[i])
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (i == strlen(ends)) {
|
|
|
|
(void) fprintf(stderr, "ztest: invalid bytes suffix: %s\n",
|
|
|
|
buf);
|
|
|
|
usage(B_FALSE);
|
|
|
|
}
|
|
|
|
if (buf[1] == '\0' || (toupper(buf[1]) == 'B' && buf[2] == '\0')) {
|
|
|
|
return (10*i);
|
|
|
|
}
|
|
|
|
(void) fprintf(stderr, "ztest: invalid bytes suffix: %s\n", buf);
|
|
|
|
usage(B_FALSE);
|
|
|
|
/* NOTREACHED */
|
|
|
|
}
|
|
|
|
|
|
|
|
static uint64_t
|
|
|
|
nicenumtoull(const char *buf)
|
|
|
|
{
|
|
|
|
char *end;
|
|
|
|
uint64_t val;
|
|
|
|
|
|
|
|
val = strtoull(buf, &end, 0);
|
|
|
|
if (end == buf) {
|
|
|
|
(void) fprintf(stderr, "ztest: bad numeric value: %s\n", buf);
|
|
|
|
usage(B_FALSE);
|
|
|
|
} else if (end[0] == '.') {
|
|
|
|
double fval = strtod(buf, &end);
|
|
|
|
fval *= pow(2, str2shift(end));
|
2020-03-16 21:56:29 +03:00
|
|
|
/*
|
|
|
|
* UINT64_MAX is not exactly representable as a double.
|
|
|
|
* The closest representation is UINT64_MAX + 1, so we
|
|
|
|
* use a >= comparison instead of > for the bounds check.
|
|
|
|
*/
|
|
|
|
if (fval >= (double)UINT64_MAX) {
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) fprintf(stderr, "ztest: value too large: %s\n",
|
|
|
|
buf);
|
|
|
|
usage(B_FALSE);
|
|
|
|
}
|
|
|
|
val = (uint64_t)fval;
|
|
|
|
} else {
|
|
|
|
int shift = str2shift(end);
|
|
|
|
if (shift >= 64 || (val << shift) >> shift != val) {
|
|
|
|
(void) fprintf(stderr, "ztest: value too large: %s\n",
|
|
|
|
buf);
|
|
|
|
usage(B_FALSE);
|
|
|
|
}
|
|
|
|
val <<= shift;
|
|
|
|
}
|
|
|
|
return (val);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
usage(boolean_t requested)
|
|
|
|
{
|
2012-01-24 06:43:32 +04:00
|
|
|
const ztest_shared_opts_t *zo = &ztest_opts_defaults;
|
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
char nice_vdev_size[NN_NUMBUF_SZ];
|
2017-07-24 21:07:39 +03:00
|
|
|
char nice_force_ganging[NN_NUMBUF_SZ];
|
2008-11-20 23:01:55 +03:00
|
|
|
FILE *fp = requested ? stdout : stderr;
|
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
nicenum(zo->zo_vdev_size, nice_vdev_size, sizeof (nice_vdev_size));
|
2017-07-24 21:07:39 +03:00
|
|
|
nicenum(zo->zo_metaslab_force_ganging, nice_force_ganging,
|
|
|
|
sizeof (nice_force_ganging));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
(void) fprintf(fp, "Usage: %s\n"
|
|
|
|
"\t[-v vdevs (default: %llu)]\n"
|
|
|
|
"\t[-s size_of_each_vdev (default: %s)]\n"
|
2010-05-29 00:45:14 +04:00
|
|
|
"\t[-a alignment_shift (default: %d)] use 0 for random\n"
|
2008-11-20 23:01:55 +03:00
|
|
|
"\t[-m mirror_copies (default: %d)]\n"
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
"\t[-r raidz_disks / draid_disks (default: %d)]\n"
|
|
|
|
"\t[-R raid_parity (default: %d)]\n"
|
|
|
|
"\t[-K raid_kind (default: random)] raidz|draid|random\n"
|
|
|
|
"\t[-D draid_data (default: %d)] in config\n"
|
|
|
|
"\t[-S draid_spares (default: %d)]\n"
|
2008-11-20 23:01:55 +03:00
|
|
|
"\t[-d datasets (default: %d)]\n"
|
|
|
|
"\t[-t threads (default: %d)]\n"
|
|
|
|
"\t[-g gang_block_threshold (default: %s)]\n"
|
2010-05-29 00:45:14 +04:00
|
|
|
"\t[-i init_count (default: %d)] initialize pool i times\n"
|
|
|
|
"\t[-k kill_percentage (default: %llu%%)]\n"
|
2008-11-20 23:01:55 +03:00
|
|
|
"\t[-p pool_name (default: %s)]\n"
|
2010-05-29 00:45:14 +04:00
|
|
|
"\t[-f dir (default: %s)] file directory for vdev files\n"
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
"\t[-M] Multi-host simulate pool imported on remote host\n"
|
2010-05-29 00:45:14 +04:00
|
|
|
"\t[-V] verbose (use multiple times for ever more blather)\n"
|
|
|
|
"\t[-E] use existing pool instead of creating new one\n"
|
|
|
|
"\t[-T time (default: %llu sec)] total run time\n"
|
|
|
|
"\t[-F freezeloops (default: %llu)] max loops in spa_freeze()\n"
|
|
|
|
"\t[-P passtime (default: %llu sec)] time per pass\n"
|
2012-01-24 06:43:32 +04:00
|
|
|
"\t[-B alt_ztest (default: <none>)] alternate ztest path\n"
|
2018-09-06 04:33:36 +03:00
|
|
|
"\t[-C vdev class state (default: random)] special=on|off|random\n"
|
2017-01-31 21:13:10 +03:00
|
|
|
"\t[-o variable=value] ... set global variable to an unsigned\n"
|
|
|
|
"\t 32-bit integer value\n"
|
2017-12-19 01:06:07 +03:00
|
|
|
"\t[-G dump zfs_dbgmsg buffer before exiting due to an error\n"
|
2008-11-20 23:01:55 +03:00
|
|
|
"\t[-h] (print help)\n"
|
|
|
|
"",
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_pool,
|
|
|
|
(u_longlong_t)zo->zo_vdevs, /* -v */
|
2008-11-20 23:01:55 +03:00
|
|
|
nice_vdev_size, /* -s */
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_ashift, /* -a */
|
|
|
|
zo->zo_mirrors, /* -m */
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
zo->zo_raid_children, /* -r */
|
|
|
|
zo->zo_raid_parity, /* -R */
|
|
|
|
zo->zo_draid_data, /* -D */
|
|
|
|
zo->zo_draid_spares, /* -S */
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_datasets, /* -d */
|
|
|
|
zo->zo_threads, /* -t */
|
2017-07-24 21:07:39 +03:00
|
|
|
nice_force_ganging, /* -g */
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_init, /* -i */
|
|
|
|
(u_longlong_t)zo->zo_killrate, /* -k */
|
|
|
|
zo->zo_pool, /* -p */
|
|
|
|
zo->zo_dir, /* -f */
|
|
|
|
(u_longlong_t)zo->zo_time, /* -T */
|
|
|
|
(u_longlong_t)zo->zo_maxloops, /* -F */
|
|
|
|
(u_longlong_t)zo->zo_passtime);
|
2008-11-20 23:01:55 +03:00
|
|
|
exit(requested ? 0 : 1);
|
|
|
|
}
|
|
|
|
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
static uint64_t
|
|
|
|
ztest_random(uint64_t range)
|
|
|
|
{
|
|
|
|
uint64_t r;
|
|
|
|
|
|
|
|
ASSERT3S(ztest_fd_rand, >=, 0);
|
|
|
|
|
|
|
|
if (range == 0)
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
if (read(ztest_fd_rand, &r, sizeof (r)) != sizeof (r))
|
|
|
|
fatal(1, "short read from /dev/urandom");
|
|
|
|
|
|
|
|
return (r % range);
|
|
|
|
}
|
2018-09-06 04:33:36 +03:00
|
|
|
|
|
|
|
static void
|
|
|
|
ztest_parse_name_value(const char *input, ztest_shared_opts_t *zo)
|
|
|
|
{
|
|
|
|
char name[32];
|
|
|
|
char *value;
|
|
|
|
int state = ZTEST_VDEV_CLASS_RND;
|
|
|
|
|
|
|
|
(void) strlcpy(name, input, sizeof (name));
|
|
|
|
|
|
|
|
value = strchr(name, '=');
|
|
|
|
if (value == NULL) {
|
|
|
|
(void) fprintf(stderr, "missing value in property=value "
|
|
|
|
"'-C' argument (%s)\n", input);
|
|
|
|
usage(B_FALSE);
|
|
|
|
}
|
|
|
|
*(value) = '\0';
|
|
|
|
value++;
|
|
|
|
|
|
|
|
if (strcmp(value, "on") == 0) {
|
|
|
|
state = ZTEST_VDEV_CLASS_ON;
|
|
|
|
} else if (strcmp(value, "off") == 0) {
|
|
|
|
state = ZTEST_VDEV_CLASS_OFF;
|
|
|
|
} else if (strcmp(value, "random") == 0) {
|
|
|
|
state = ZTEST_VDEV_CLASS_RND;
|
|
|
|
} else {
|
|
|
|
(void) fprintf(stderr, "invalid property value '%s'\n", value);
|
|
|
|
usage(B_FALSE);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (strcmp(name, "special") == 0) {
|
|
|
|
zo->zo_special_vdevs = state;
|
|
|
|
} else {
|
|
|
|
(void) fprintf(stderr, "invalid property name '%s'\n", name);
|
|
|
|
usage(B_FALSE);
|
|
|
|
}
|
|
|
|
if (zo->zo_verbose >= 3)
|
|
|
|
(void) printf("%s vdev state is '%s'\n", name, value);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
process_options(int argc, char **argv)
|
|
|
|
{
|
2012-01-24 06:43:32 +04:00
|
|
|
char *path;
|
|
|
|
ztest_shared_opts_t *zo = &ztest_opts;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
int opt;
|
|
|
|
uint64_t value;
|
2012-01-24 06:43:32 +04:00
|
|
|
char altdir[MAXNAMELEN] = { 0 };
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
char raid_kind[8] = { "random" };
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
bcopy(&ztest_opts_defaults, zo, sizeof (*zo));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
while ((opt = getopt(argc, argv,
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
"v:s:a:m:r:R:K:D:S:d:t:g:i:k:p:f:MVET:P:hF:B:C:o:G")) != EOF) {
|
2008-11-20 23:01:55 +03:00
|
|
|
value = 0;
|
|
|
|
switch (opt) {
|
|
|
|
case 'v':
|
|
|
|
case 's':
|
|
|
|
case 'a':
|
|
|
|
case 'm':
|
|
|
|
case 'r':
|
|
|
|
case 'R':
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
case 'D':
|
|
|
|
case 'S':
|
2008-11-20 23:01:55 +03:00
|
|
|
case 'd':
|
|
|
|
case 't':
|
|
|
|
case 'g':
|
|
|
|
case 'i':
|
|
|
|
case 'k':
|
|
|
|
case 'T':
|
|
|
|
case 'P':
|
2010-05-29 00:45:14 +04:00
|
|
|
case 'F':
|
2008-11-20 23:01:55 +03:00
|
|
|
value = nicenumtoull(optarg);
|
|
|
|
}
|
|
|
|
switch (opt) {
|
|
|
|
case 'v':
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_vdevs = value;
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 's':
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_vdev_size = MAX(SPA_MINDEVSIZE, value);
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 'a':
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_ashift = value;
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 'm':
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_mirrors = value;
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 'r':
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
zo->zo_raid_children = MAX(1, value);
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 'R':
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
zo->zo_raid_parity = MIN(MAX(value, 1), 3);
|
|
|
|
break;
|
|
|
|
case 'K':
|
|
|
|
(void) strlcpy(raid_kind, optarg, sizeof (raid_kind));
|
|
|
|
break;
|
|
|
|
case 'D':
|
|
|
|
zo->zo_draid_data = MAX(1, value);
|
|
|
|
break;
|
|
|
|
case 'S':
|
|
|
|
zo->zo_draid_spares = MAX(1, value);
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 'd':
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_datasets = MAX(1, value);
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 't':
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_threads = MAX(1, value);
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 'g':
|
2017-07-24 21:07:39 +03:00
|
|
|
zo->zo_metaslab_force_ganging =
|
|
|
|
MAX(SPA_MINBLOCKSIZE << 1, value);
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 'i':
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_init = value;
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 'k':
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_killrate = value;
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 'p':
|
2012-01-24 06:43:32 +04:00
|
|
|
(void) strlcpy(zo->zo_pool, optarg,
|
|
|
|
sizeof (zo->zo_pool));
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 'f':
|
2012-01-24 06:43:32 +04:00
|
|
|
path = realpath(optarg, NULL);
|
|
|
|
if (path == NULL) {
|
|
|
|
(void) fprintf(stderr, "error: %s: %s\n",
|
|
|
|
optarg, strerror(errno));
|
|
|
|
usage(B_FALSE);
|
|
|
|
} else {
|
|
|
|
(void) strlcpy(zo->zo_dir, path,
|
|
|
|
sizeof (zo->zo_dir));
|
2016-07-28 23:10:05 +03:00
|
|
|
free(path);
|
2012-01-24 06:43:32 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
case 'M':
|
|
|
|
zo->zo_mmp_test = 1;
|
|
|
|
break;
|
2008-11-20 23:01:55 +03:00
|
|
|
case 'V':
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_verbose++;
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 'E':
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_init = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 'T':
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_time = value;
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 'P':
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_passtime = MAX(1, value);
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
2010-05-29 00:45:14 +04:00
|
|
|
case 'F':
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_maxloops = MAX(1, value);
|
|
|
|
break;
|
|
|
|
case 'B':
|
|
|
|
(void) strlcpy(altdir, optarg, sizeof (altdir));
|
2010-05-29 00:45:14 +04:00
|
|
|
break;
|
2018-09-06 04:33:36 +03:00
|
|
|
case 'C':
|
|
|
|
ztest_parse_name_value(optarg, zo);
|
|
|
|
break;
|
2017-01-31 21:13:10 +03:00
|
|
|
case 'o':
|
2021-02-16 13:14:44 +03:00
|
|
|
if (zo->zo_gvars_count >= ZO_GVARS_MAX_COUNT) {
|
|
|
|
(void) fprintf(stderr,
|
|
|
|
"max global var count (%zu) exceeded\n",
|
|
|
|
ZO_GVARS_MAX_COUNT);
|
|
|
|
usage(B_FALSE);
|
|
|
|
}
|
|
|
|
char *v = zo->zo_gvars[zo->zo_gvars_count];
|
|
|
|
if (strlcpy(v, optarg, ZO_GVARS_MAX_ARGLEN) >=
|
|
|
|
ZO_GVARS_MAX_ARGLEN) {
|
|
|
|
(void) fprintf(stderr,
|
|
|
|
"global var option '%s' is too long\n",
|
|
|
|
optarg);
|
2017-01-31 21:13:10 +03:00
|
|
|
usage(B_FALSE);
|
2021-02-16 13:14:44 +03:00
|
|
|
}
|
|
|
|
zo->zo_gvars_count++;
|
2017-01-31 21:13:10 +03:00
|
|
|
break;
|
2017-12-19 01:06:07 +03:00
|
|
|
case 'G':
|
2018-10-15 22:14:22 +03:00
|
|
|
zo->zo_dump_dbgmsg = 1;
|
2017-12-19 01:06:07 +03:00
|
|
|
break;
|
2008-11-20 23:01:55 +03:00
|
|
|
case 'h':
|
|
|
|
usage(B_TRUE);
|
|
|
|
break;
|
|
|
|
case '?':
|
|
|
|
default:
|
|
|
|
usage(B_FALSE);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
/* When raid choice is 'random' add a draid pool 50% of the time */
|
|
|
|
if (strcmp(raid_kind, "random") == 0) {
|
|
|
|
(void) strlcpy(raid_kind, (ztest_random(2) == 0) ?
|
|
|
|
"draid" : "raidz", sizeof (raid_kind));
|
|
|
|
|
|
|
|
if (ztest_opts.zo_verbose >= 3)
|
|
|
|
(void) printf("choosing RAID type '%s'\n", raid_kind);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (strcmp(raid_kind, "draid") == 0) {
|
|
|
|
uint64_t min_devsize;
|
|
|
|
|
|
|
|
/* With fewer disk use 256M, otherwise 128M is OK */
|
|
|
|
min_devsize = (ztest_opts.zo_raid_children < 16) ?
|
|
|
|
(256ULL << 20) : (128ULL << 20);
|
|
|
|
|
|
|
|
/* No top-level mirrors with dRAID for now */
|
|
|
|
zo->zo_mirrors = 0;
|
|
|
|
|
|
|
|
/* Use more appropriate defaults for dRAID */
|
|
|
|
if (zo->zo_vdevs == ztest_opts_defaults.zo_vdevs)
|
|
|
|
zo->zo_vdevs = 1;
|
|
|
|
if (zo->zo_raid_children ==
|
|
|
|
ztest_opts_defaults.zo_raid_children)
|
|
|
|
zo->zo_raid_children = 16;
|
|
|
|
if (zo->zo_ashift < 12)
|
|
|
|
zo->zo_ashift = 12;
|
|
|
|
if (zo->zo_vdev_size < min_devsize)
|
|
|
|
zo->zo_vdev_size = min_devsize;
|
|
|
|
|
|
|
|
if (zo->zo_draid_data + zo->zo_raid_parity >
|
|
|
|
zo->zo_raid_children - zo->zo_draid_spares) {
|
|
|
|
(void) fprintf(stderr, "error: too few draid "
|
|
|
|
"children (%d) for stripe width (%d)\n",
|
|
|
|
zo->zo_raid_children,
|
|
|
|
zo->zo_draid_data + zo->zo_raid_parity);
|
|
|
|
usage(B_FALSE);
|
|
|
|
}
|
|
|
|
|
|
|
|
(void) strlcpy(zo->zo_raid_type, VDEV_TYPE_DRAID,
|
|
|
|
sizeof (zo->zo_raid_type));
|
|
|
|
|
|
|
|
} else /* using raidz */ {
|
|
|
|
ASSERT0(strcmp(raid_kind, "raidz"));
|
|
|
|
|
|
|
|
zo->zo_raid_parity = MIN(zo->zo_raid_parity,
|
|
|
|
zo->zo_raid_children - 1);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_vdevtime =
|
|
|
|
(zo->zo_vdevs > 0 ? zo->zo_time * NANOSEC / zo->zo_vdevs :
|
2010-05-29 00:45:14 +04:00
|
|
|
UINT64_MAX >> 2);
|
2012-01-24 06:43:32 +04:00
|
|
|
|
|
|
|
if (strlen(altdir) > 0) {
|
2012-10-04 23:54:47 +04:00
|
|
|
char *cmd;
|
|
|
|
char *realaltdir;
|
2012-01-24 06:43:32 +04:00
|
|
|
char *bin;
|
|
|
|
char *ztest;
|
|
|
|
char *isa;
|
|
|
|
int isalen;
|
|
|
|
|
2012-10-04 23:54:47 +04:00
|
|
|
cmd = umem_alloc(MAXPATHLEN, UMEM_NOFAIL);
|
|
|
|
realaltdir = umem_alloc(MAXPATHLEN, UMEM_NOFAIL);
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY3P(NULL, !=, realpath(getexecname(), cmd));
|
2012-01-24 06:43:32 +04:00
|
|
|
if (0 != access(altdir, F_OK)) {
|
|
|
|
ztest_dump_core = B_FALSE;
|
|
|
|
fatal(B_TRUE, "invalid alternate ztest path: %s",
|
|
|
|
altdir);
|
|
|
|
}
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY3P(NULL, !=, realpath(altdir, realaltdir));
|
2012-01-24 06:43:32 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* 'cmd' should be of the form "<anything>/usr/bin/<isa>/ztest".
|
|
|
|
* We want to extract <isa> to determine if we should use
|
|
|
|
* 32 or 64 bit binaries.
|
|
|
|
*/
|
|
|
|
bin = strstr(cmd, "/usr/bin/");
|
|
|
|
ztest = strstr(bin, "/ztest");
|
|
|
|
isa = bin + 9;
|
|
|
|
isalen = ztest - isa;
|
|
|
|
(void) snprintf(zo->zo_alt_ztest, sizeof (zo->zo_alt_ztest),
|
|
|
|
"%s/usr/bin/%.*s/ztest", realaltdir, isalen, isa);
|
|
|
|
(void) snprintf(zo->zo_alt_libpath, sizeof (zo->zo_alt_libpath),
|
|
|
|
"%s/usr/lib/%.*s", realaltdir, isalen, isa);
|
|
|
|
|
|
|
|
if (0 != access(zo->zo_alt_ztest, X_OK)) {
|
|
|
|
ztest_dump_core = B_FALSE;
|
|
|
|
fatal(B_TRUE, "invalid alternate ztest: %s",
|
|
|
|
zo->zo_alt_ztest);
|
|
|
|
} else if (0 != access(zo->zo_alt_libpath, X_OK)) {
|
|
|
|
ztest_dump_core = B_FALSE;
|
|
|
|
fatal(B_TRUE, "invalid alternate lib directory %s",
|
|
|
|
zo->zo_alt_libpath);
|
|
|
|
}
|
2012-10-04 23:54:47 +04:00
|
|
|
|
|
|
|
umem_free(cmd, MAXPATHLEN);
|
|
|
|
umem_free(realaltdir, MAXPATHLEN);
|
2012-01-24 06:43:32 +04:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
ztest_kill(ztest_shared_t *zs)
|
|
|
|
{
|
2012-01-24 06:43:32 +04:00
|
|
|
zs->zs_alloc = metaslab_class_get_alloc(spa_normal_class(ztest_spa));
|
|
|
|
zs->zs_space = metaslab_class_get_space(spa_normal_class(ztest_spa));
|
2013-08-08 00:16:22 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Before we kill off ztest, make sure that the config is updated.
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
* See comment above spa_write_cachefile().
|
2013-08-08 00:16:22 +04:00
|
|
|
*/
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
spa_write_cachefile(ztest_spa, B_FALSE, B_FALSE);
|
2013-08-08 00:16:22 +04:00
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) kill(getpid(), SIGKILL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* ARGSUSED */
|
|
|
|
static void
|
|
|
|
ztest_record_enospc(const char *s)
|
|
|
|
{
|
|
|
|
ztest_shared->zs_enospc_count++;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static uint64_t
|
|
|
|
ztest_get_ashift(void)
|
|
|
|
{
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_ashift == 0)
|
2014-09-23 03:42:03 +04:00
|
|
|
return (SPA_MINBLOCKSHIFT + ztest_random(5));
|
2012-01-24 06:43:32 +04:00
|
|
|
return (ztest_opts.zo_ashift);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
static boolean_t
|
|
|
|
ztest_is_draid_spare(const char *name)
|
|
|
|
{
|
|
|
|
uint64_t spare_id = 0, parity = 0, vdev_id = 0;
|
|
|
|
|
|
|
|
if (sscanf(name, VDEV_TYPE_DRAID "%llu-%llu-%llu",
|
|
|
|
(u_longlong_t *)&parity, (u_longlong_t *)&vdev_id,
|
|
|
|
(u_longlong_t *)&spare_id) == 3) {
|
|
|
|
return (B_TRUE);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (B_FALSE);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static nvlist_t *
|
2012-12-15 04:28:49 +04:00
|
|
|
make_vdev_file(char *path, char *aux, char *pool, size_t size, uint64_t ashift)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-08-26 22:13:05 +04:00
|
|
|
char *pathbuf;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t vdev;
|
|
|
|
nvlist_t *file;
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
boolean_t draid_spare = B_FALSE;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
pathbuf = umem_alloc(MAXPATHLEN, UMEM_NOFAIL);
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
if (ashift == 0)
|
|
|
|
ashift = ztest_get_ashift();
|
|
|
|
|
|
|
|
if (path == NULL) {
|
|
|
|
path = pathbuf;
|
|
|
|
|
|
|
|
if (aux != NULL) {
|
|
|
|
vdev = ztest_shared->zs_vdev_aux;
|
2012-01-24 06:43:32 +04:00
|
|
|
(void) snprintf(path, MAXPATHLEN,
|
|
|
|
ztest_aux_template, ztest_opts.zo_dir,
|
2012-12-15 04:28:49 +04:00
|
|
|
pool == NULL ? ztest_opts.zo_pool : pool,
|
|
|
|
aux, vdev);
|
2008-12-03 23:09:06 +03:00
|
|
|
} else {
|
2010-05-29 00:45:14 +04:00
|
|
|
vdev = ztest_shared->zs_vdev_next_leaf++;
|
2012-01-24 06:43:32 +04:00
|
|
|
(void) snprintf(path, MAXPATHLEN,
|
|
|
|
ztest_dev_template, ztest_opts.zo_dir,
|
2012-12-15 04:28:49 +04:00
|
|
|
pool == NULL ? ztest_opts.zo_pool : pool, vdev);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
} else {
|
|
|
|
draid_spare = ztest_is_draid_spare(path);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
if (size != 0 && !draid_spare) {
|
2008-12-03 23:09:06 +03:00
|
|
|
int fd = open(path, O_RDWR | O_CREAT | O_TRUNC, 0666);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (fd == -1)
|
2008-12-03 23:09:06 +03:00
|
|
|
fatal(1, "can't open %s", path);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (ftruncate(fd, size) != 0)
|
2008-12-03 23:09:06 +03:00
|
|
|
fatal(1, "can't ftruncate %s", path);
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) close(fd);
|
|
|
|
}
|
|
|
|
|
2021-01-11 20:30:33 +03:00
|
|
|
file = fnvlist_alloc();
|
|
|
|
fnvlist_add_string(file, ZPOOL_CONFIG_TYPE,
|
|
|
|
draid_spare ? VDEV_TYPE_DRAID_SPARE : VDEV_TYPE_FILE);
|
|
|
|
fnvlist_add_string(file, ZPOOL_CONFIG_PATH, path);
|
|
|
|
fnvlist_add_uint64(file, ZPOOL_CONFIG_ASHIFT, ashift);
|
2010-08-26 22:13:05 +04:00
|
|
|
umem_free(pathbuf, MAXPATHLEN);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
return (file);
|
|
|
|
}
|
|
|
|
|
|
|
|
static nvlist_t *
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
make_vdev_raid(char *path, char *aux, char *pool, size_t size,
|
2012-12-15 04:28:49 +04:00
|
|
|
uint64_t ashift, int r)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
nvlist_t *raid, **child;
|
2008-11-20 23:01:55 +03:00
|
|
|
int c;
|
|
|
|
|
|
|
|
if (r < 2)
|
2012-12-15 04:28:49 +04:00
|
|
|
return (make_vdev_file(path, aux, pool, size, ashift));
|
2008-11-20 23:01:55 +03:00
|
|
|
child = umem_alloc(r * sizeof (nvlist_t *), UMEM_NOFAIL);
|
|
|
|
|
|
|
|
for (c = 0; c < r; c++)
|
2012-12-15 04:28:49 +04:00
|
|
|
child[c] = make_vdev_file(path, aux, pool, size, ashift);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-01-11 20:30:33 +03:00
|
|
|
raid = fnvlist_alloc();
|
|
|
|
fnvlist_add_string(raid, ZPOOL_CONFIG_TYPE,
|
|
|
|
ztest_opts.zo_raid_type);
|
|
|
|
fnvlist_add_uint64(raid, ZPOOL_CONFIG_NPARITY,
|
|
|
|
ztest_opts.zo_raid_parity);
|
|
|
|
fnvlist_add_nvlist_array(raid, ZPOOL_CONFIG_CHILDREN, child, r);
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
|
|
|
|
if (strcmp(ztest_opts.zo_raid_type, VDEV_TYPE_DRAID) == 0) {
|
|
|
|
uint64_t ndata = ztest_opts.zo_draid_data;
|
|
|
|
uint64_t nparity = ztest_opts.zo_raid_parity;
|
|
|
|
uint64_t nspares = ztest_opts.zo_draid_spares;
|
|
|
|
uint64_t children = ztest_opts.zo_raid_children;
|
|
|
|
uint64_t ngroups = 1;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Calculate the minimum number of groups required to fill a
|
|
|
|
* slice. This is the LCM of the stripe width (data + parity)
|
|
|
|
* and the number of data drives (children - spares).
|
|
|
|
*/
|
|
|
|
while (ngroups * (ndata + nparity) % (children - nspares) != 0)
|
|
|
|
ngroups++;
|
|
|
|
|
|
|
|
/* Store the basic dRAID configuration. */
|
|
|
|
fnvlist_add_uint64(raid, ZPOOL_CONFIG_DRAID_NDATA, ndata);
|
|
|
|
fnvlist_add_uint64(raid, ZPOOL_CONFIG_DRAID_NSPARES, nspares);
|
|
|
|
fnvlist_add_uint64(raid, ZPOOL_CONFIG_DRAID_NGROUPS, ngroups);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
for (c = 0; c < r; c++)
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(child[c]);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
umem_free(child, r * sizeof (nvlist_t *));
|
|
|
|
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
return (raid);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static nvlist_t *
|
2012-12-15 04:28:49 +04:00
|
|
|
make_vdev_mirror(char *path, char *aux, char *pool, size_t size,
|
|
|
|
uint64_t ashift, int r, int m)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
nvlist_t *mirror, **child;
|
|
|
|
int c;
|
|
|
|
|
|
|
|
if (m < 1)
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
return (make_vdev_raid(path, aux, pool, size, ashift, r));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
child = umem_alloc(m * sizeof (nvlist_t *), UMEM_NOFAIL);
|
|
|
|
|
|
|
|
for (c = 0; c < m; c++)
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
child[c] = make_vdev_raid(path, aux, pool, size, ashift, r);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-01-11 20:30:33 +03:00
|
|
|
mirror = fnvlist_alloc();
|
|
|
|
fnvlist_add_string(mirror, ZPOOL_CONFIG_TYPE, VDEV_TYPE_MIRROR);
|
|
|
|
fnvlist_add_nvlist_array(mirror, ZPOOL_CONFIG_CHILDREN, child, m);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
for (c = 0; c < m; c++)
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(child[c]);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
umem_free(child, m * sizeof (nvlist_t *));
|
|
|
|
|
|
|
|
return (mirror);
|
|
|
|
}
|
|
|
|
|
|
|
|
static nvlist_t *
|
2012-12-15 04:28:49 +04:00
|
|
|
make_vdev_root(char *path, char *aux, char *pool, size_t size, uint64_t ashift,
|
2018-09-06 04:33:36 +03:00
|
|
|
const char *class, int r, int m, int t)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
nvlist_t *root, **child;
|
|
|
|
int c;
|
2018-09-06 04:33:36 +03:00
|
|
|
boolean_t log;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3S(t, >, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2018-09-06 04:33:36 +03:00
|
|
|
log = (class != NULL && strcmp(class, "log") == 0);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
child = umem_alloc(t * sizeof (nvlist_t *), UMEM_NOFAIL);
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
for (c = 0; c < t; c++) {
|
2012-12-15 04:28:49 +04:00
|
|
|
child[c] = make_vdev_mirror(path, aux, pool, size, ashift,
|
|
|
|
r, m);
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_add_uint64(child[c], ZPOOL_CONFIG_IS_LOG, log);
|
2018-09-06 04:33:36 +03:00
|
|
|
|
|
|
|
if (class != NULL && class[0] != '\0') {
|
|
|
|
ASSERT(m > 1 || log); /* expecting a mirror */
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_add_string(child[c],
|
|
|
|
ZPOOL_CONFIG_ALLOCATION_BIAS, class);
|
2018-09-06 04:33:36 +03:00
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-01-11 20:30:33 +03:00
|
|
|
root = fnvlist_alloc();
|
|
|
|
fnvlist_add_string(root, ZPOOL_CONFIG_TYPE, VDEV_TYPE_ROOT);
|
|
|
|
fnvlist_add_nvlist_array(root, aux ? aux : ZPOOL_CONFIG_CHILDREN,
|
|
|
|
child, t);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
for (c = 0; c < t; c++)
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(child[c]);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
umem_free(child, t * sizeof (nvlist_t *));
|
|
|
|
|
|
|
|
return (root);
|
|
|
|
}
|
|
|
|
|
2012-12-15 04:28:49 +04:00
|
|
|
/*
|
|
|
|
* Find a random spa version. Returns back a random spa version in the
|
|
|
|
* range [initial_version, SPA_VERSION_FEATURES].
|
|
|
|
*/
|
|
|
|
static uint64_t
|
|
|
|
ztest_random_spa_version(uint64_t initial_version)
|
|
|
|
{
|
|
|
|
uint64_t version = initial_version;
|
|
|
|
|
|
|
|
if (version <= SPA_VERSION_BEFORE_FEATURES) {
|
|
|
|
version = version +
|
|
|
|
ztest_random(SPA_VERSION_BEFORE_FEATURES - version + 1);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (version > SPA_VERSION_BEFORE_FEATURES)
|
|
|
|
version = SPA_VERSION_FEATURES;
|
|
|
|
|
|
|
|
ASSERT(SPA_VERSION_IS_SUPPORTED(version));
|
|
|
|
return (version);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static int
|
|
|
|
ztest_random_blocksize(void)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(ztest_spa->spa_max_ashift, !=, 0);
|
2018-09-06 04:33:36 +03:00
|
|
|
|
2014-11-03 23:15:08 +03:00
|
|
|
/*
|
|
|
|
* Choose a block size >= the ashift.
|
|
|
|
* If the SPA supports new MAXBLOCKSIZE, test up to 1MB blocks.
|
|
|
|
*/
|
|
|
|
int maxbs = SPA_OLD_MAXBLOCKSHIFT;
|
|
|
|
if (spa_maxblocksize(ztest_spa) == SPA_MAXBLOCKSIZE)
|
|
|
|
maxbs = 20;
|
2015-05-20 07:14:01 +03:00
|
|
|
uint64_t block_shift =
|
|
|
|
ztest_random(maxbs - ztest_spa->spa_max_ashift + 1);
|
2014-09-23 03:42:03 +04:00
|
|
|
return (1 << (SPA_MINBLOCKSHIFT + block_shift));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
static int
|
|
|
|
ztest_random_dnodesize(void)
|
|
|
|
{
|
|
|
|
int slots;
|
|
|
|
int max_slots = spa_maxdnodesize(ztest_spa) >> DNODE_SHIFT;
|
|
|
|
|
|
|
|
if (max_slots == DNODE_MIN_SLOTS)
|
|
|
|
return (DNODE_MIN_SIZE);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Weight the random distribution more heavily toward smaller
|
|
|
|
* dnode sizes since that is more likely to reflect real-world
|
|
|
|
* usage.
|
|
|
|
*/
|
|
|
|
ASSERT3U(max_slots, >, 4);
|
|
|
|
switch (ztest_random(10)) {
|
|
|
|
case 0:
|
|
|
|
slots = 5 + ztest_random(max_slots - 4);
|
|
|
|
break;
|
|
|
|
case 1 ... 4:
|
|
|
|
slots = 2 + ztest_random(3);
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
slots = 1;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return (slots << DNODE_SHIFT);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static int
|
|
|
|
ztest_random_ibshift(void)
|
|
|
|
{
|
|
|
|
return (DN_MIN_INDBLKSHIFT +
|
|
|
|
ztest_random(DN_MAX_INDBLKSHIFT - DN_MIN_INDBLKSHIFT + 1));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static uint64_t
|
|
|
|
ztest_random_vdev_top(spa_t *spa, boolean_t log_ok)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t top;
|
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
vdev_t *tvd;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
do {
|
|
|
|
top = ztest_random(rvd->vdev_children);
|
|
|
|
tvd = rvd->vdev_child[top];
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
} while (!vdev_is_concrete(tvd) || (tvd->vdev_islog && !log_ok) ||
|
2010-05-29 00:45:14 +04:00
|
|
|
tvd->vdev_mg == NULL || tvd->vdev_mg->mg_class == NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
return (top);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static uint64_t
|
|
|
|
ztest_random_dsl_prop(zfs_prop_t prop)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t value;
|
|
|
|
|
|
|
|
do {
|
|
|
|
value = zfs_prop_random_value(prop, ztest_random(-1ULL));
|
|
|
|
} while (prop == ZFS_PROP_CHECKSUM && value == ZIO_CHECKSUM_OFF);
|
|
|
|
|
|
|
|
return (value);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_dsl_prop_set_uint64(char *osname, zfs_prop_t prop, uint64_t value,
|
|
|
|
boolean_t inherit)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
const char *propname = zfs_prop_to_name(prop);
|
|
|
|
const char *valname;
|
2010-08-26 22:13:05 +04:00
|
|
|
char *setpoint;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t curval;
|
2008-11-20 23:01:55 +03:00
|
|
|
int error;
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dsl_prop_set_int(osname, propname,
|
|
|
|
(inherit ? ZPROP_SRC_NONE : ZPROP_SRC_LOCAL), value);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error == ENOSPC) {
|
|
|
|
ztest_record_enospc(FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
return (error);
|
|
|
|
}
|
2013-05-11 01:17:03 +04:00
|
|
|
ASSERT0(error);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
setpoint = umem_alloc(MAXPATHLEN, UMEM_NOFAIL);
|
2013-09-04 16:00:57 +04:00
|
|
|
VERIFY0(dsl_prop_get_integer(osname, propname, &curval, setpoint));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 6) {
|
2015-07-30 07:11:32 +03:00
|
|
|
int err;
|
|
|
|
|
|
|
|
err = zfs_prop_index_to_string(prop, curval, &valname);
|
|
|
|
if (err)
|
2016-12-12 21:46:26 +03:00
|
|
|
(void) printf("%s %s = %llu at '%s'\n", osname,
|
|
|
|
propname, (unsigned long long)curval, setpoint);
|
2015-07-30 07:11:32 +03:00
|
|
|
else
|
|
|
|
(void) printf("%s %s = %s at '%s'\n",
|
|
|
|
osname, propname, valname, setpoint);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-08-26 22:13:05 +04:00
|
|
|
umem_free(setpoint, MAXPATHLEN);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_spa_prop_set_uint64(zpool_prop_t prop, uint64_t value)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2012-01-24 06:43:32 +04:00
|
|
|
spa_t *spa = ztest_spa;
|
2010-05-29 00:45:14 +04:00
|
|
|
nvlist_t *props = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
int error;
|
|
|
|
|
2021-01-11 20:30:33 +03:00
|
|
|
props = fnvlist_alloc();
|
|
|
|
fnvlist_add_uint64(props, zpool_prop_to_name(prop), value);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
error = spa_prop_set(spa, props);
|
|
|
|
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(props);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (error == ENOSPC) {
|
|
|
|
ztest_record_enospc(FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
return (error);
|
|
|
|
}
|
2013-05-11 01:17:03 +04:00
|
|
|
ASSERT0(error);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2017-09-12 23:15:11 +03:00
|
|
|
static int
|
|
|
|
ztest_dmu_objset_own(const char *name, dmu_objset_type_t type,
|
|
|
|
boolean_t readonly, boolean_t decrypt, void *tag, objset_t **osp)
|
|
|
|
{
|
|
|
|
int err;
|
2018-11-08 02:40:24 +03:00
|
|
|
char *cp = NULL;
|
|
|
|
char ddname[ZFS_MAX_DATASET_NAME_LEN];
|
|
|
|
|
|
|
|
strcpy(ddname, name);
|
|
|
|
cp = strchr(ddname, '@');
|
|
|
|
if (cp != NULL)
|
|
|
|
*cp = '\0';
|
2017-09-12 23:15:11 +03:00
|
|
|
|
|
|
|
err = dmu_objset_own(name, type, readonly, decrypt, tag, osp);
|
2018-11-08 02:40:24 +03:00
|
|
|
while (decrypt && err == EACCES) {
|
2017-09-12 23:15:11 +03:00
|
|
|
dsl_crypto_params_t *dcp;
|
|
|
|
nvlist_t *crypto_args = fnvlist_alloc();
|
|
|
|
|
|
|
|
fnvlist_add_uint8_array(crypto_args, "wkeydata",
|
|
|
|
(uint8_t *)ztest_wkeydata, WRAPPING_KEY_LEN);
|
|
|
|
VERIFY0(dsl_crypto_params_create_nvlist(DCP_CMD_NONE, NULL,
|
|
|
|
crypto_args, &dcp));
|
|
|
|
err = spa_keystore_load_wkey(ddname, dcp, B_FALSE);
|
2020-12-25 07:58:17 +03:00
|
|
|
/*
|
|
|
|
* Note: if there was an error loading, the wkey was not
|
|
|
|
* consumed, and needs to be freed.
|
|
|
|
*/
|
|
|
|
dsl_crypto_params_free(dcp, (err != 0));
|
2017-09-12 23:15:11 +03:00
|
|
|
fnvlist_free(crypto_args);
|
|
|
|
|
2018-11-08 02:40:24 +03:00
|
|
|
if (err == EINVAL) {
|
|
|
|
/*
|
|
|
|
* We couldn't load a key for this dataset so try
|
|
|
|
* the parent. This loop will eventually hit the
|
|
|
|
* encryption root since ztest only makes clones
|
|
|
|
* as children of their origin datasets.
|
|
|
|
*/
|
|
|
|
cp = strrchr(ddname, '/');
|
|
|
|
if (cp == NULL)
|
|
|
|
return (err);
|
|
|
|
|
|
|
|
*cp = '\0';
|
|
|
|
err = EACCES;
|
|
|
|
continue;
|
|
|
|
} else if (err != 0) {
|
|
|
|
break;
|
|
|
|
}
|
2017-09-12 23:15:11 +03:00
|
|
|
|
|
|
|
err = dmu_objset_own(name, type, readonly, decrypt, tag, osp);
|
2018-11-08 02:40:24 +03:00
|
|
|
break;
|
2017-09-12 23:15:11 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
return (err);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
ztest_rll_init(rll_t *rll)
|
|
|
|
{
|
|
|
|
rll->rll_writer = NULL;
|
|
|
|
rll->rll_readers = 0;
|
2010-08-26 21:43:27 +04:00
|
|
|
mutex_init(&rll->rll_lock, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
cv_init(&rll->rll_cv, NULL, CV_DEFAULT, NULL);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
ztest_rll_destroy(rll_t *rll)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3P(rll->rll_writer, ==, NULL);
|
|
|
|
ASSERT0(rll->rll_readers);
|
2010-08-26 21:43:27 +04:00
|
|
|
mutex_destroy(&rll->rll_lock);
|
|
|
|
cv_destroy(&rll->rll_cv);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
ztest_rll_lock(rll_t *rll, rl_type_t type)
|
|
|
|
{
|
2010-08-26 21:43:27 +04:00
|
|
|
mutex_enter(&rll->rll_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (type == RL_READER) {
|
|
|
|
while (rll->rll_writer != NULL)
|
2010-08-26 21:43:27 +04:00
|
|
|
(void) cv_wait(&rll->rll_cv, &rll->rll_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
rll->rll_readers++;
|
|
|
|
} else {
|
|
|
|
while (rll->rll_writer != NULL || rll->rll_readers)
|
2010-08-26 21:43:27 +04:00
|
|
|
(void) cv_wait(&rll->rll_cv, &rll->rll_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
rll->rll_writer = curthread;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 21:43:27 +04:00
|
|
|
mutex_exit(&rll->rll_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
ztest_rll_unlock(rll_t *rll)
|
|
|
|
{
|
2010-08-26 21:43:27 +04:00
|
|
|
mutex_enter(&rll->rll_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (rll->rll_writer) {
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT0(rll->rll_readers);
|
2010-05-29 00:45:14 +04:00
|
|
|
rll->rll_writer = NULL;
|
|
|
|
} else {
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3S(rll->rll_readers, >, 0);
|
|
|
|
ASSERT3P(rll->rll_writer, ==, NULL);
|
2010-05-29 00:45:14 +04:00
|
|
|
rll->rll_readers--;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (rll->rll_writer == NULL && rll->rll_readers == 0)
|
2010-08-26 21:43:27 +04:00
|
|
|
cv_broadcast(&rll->rll_cv);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-26 21:43:27 +04:00
|
|
|
mutex_exit(&rll->rll_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
ztest_object_lock(ztest_ds_t *zd, uint64_t object, rl_type_t type)
|
2008-12-03 23:09:06 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
rll_t *rll = &zd->zd_object_lock[object & (ZTEST_OBJECT_LOCKS - 1)];
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_rll_lock(rll, type);
|
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
ztest_object_unlock(ztest_ds_t *zd, uint64_t object)
|
|
|
|
{
|
|
|
|
rll_t *rll = &zd->zd_object_lock[object & (ZTEST_OBJECT_LOCKS - 1)];
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_rll_unlock(rll);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
|
|
|
|
2018-10-02 01:13:12 +03:00
|
|
|
static rl_t *
|
2016-01-05 23:47:58 +03:00
|
|
|
ztest_range_lock(ztest_ds_t *zd, uint64_t object, uint64_t offset,
|
|
|
|
uint64_t size, rl_type_t type)
|
|
|
|
{
|
2018-10-02 01:13:12 +03:00
|
|
|
uint64_t hash = object ^ (offset % (ZTEST_RANGE_LOCKS + 1));
|
|
|
|
rll_t *rll = &zd->zd_range_lock[hash & (ZTEST_RANGE_LOCKS - 1)];
|
|
|
|
rl_t *rl;
|
|
|
|
|
|
|
|
rl = umem_alloc(sizeof (*rl), UMEM_NOFAIL);
|
|
|
|
rl->rl_object = object;
|
|
|
|
rl->rl_offset = offset;
|
|
|
|
rl->rl_size = size;
|
|
|
|
rl->rl_lock = rll;
|
|
|
|
|
|
|
|
ztest_rll_lock(rll, type);
|
|
|
|
|
|
|
|
return (rl);
|
2016-01-05 23:47:58 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-01-05 23:47:58 +03:00
|
|
|
static void
|
2018-10-02 01:13:12 +03:00
|
|
|
ztest_range_unlock(rl_t *rl)
|
2016-01-05 23:47:58 +03:00
|
|
|
{
|
2018-10-02 01:13:12 +03:00
|
|
|
rll_t *rll = rl->rl_lock;
|
|
|
|
|
|
|
|
ztest_rll_unlock(rll);
|
|
|
|
|
|
|
|
umem_free(rl, sizeof (*rl));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_zd_init(ztest_ds_t *zd, ztest_shared_ds_t *szd, objset_t *os)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
|
|
|
zd->zd_os = os;
|
|
|
|
zd->zd_zilog = dmu_objset_zil(os);
|
2012-01-24 06:43:32 +04:00
|
|
|
zd->zd_shared = szd;
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_objset_name(os, zd->zd_name);
|
2010-08-26 20:52:39 +04:00
|
|
|
int l;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (zd->zd_shared != NULL)
|
|
|
|
zd->zd_shared->zd_seq = 0;
|
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
VERIFY0(pthread_rwlock_init(&zd->zd_zilog_lock, NULL));
|
2010-08-26 21:43:27 +04:00
|
|
|
mutex_init(&zd->zd_dirobj_lock, NULL, MUTEX_DEFAULT, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (l = 0; l < ZTEST_OBJECT_LOCKS; l++)
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_rll_init(&zd->zd_object_lock[l]);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (l = 0; l < ZTEST_RANGE_LOCKS; l++)
|
2018-10-02 01:13:12 +03:00
|
|
|
ztest_rll_init(&zd->zd_range_lock[l]);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
ztest_zd_fini(ztest_ds_t *zd)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-08-26 20:52:39 +04:00
|
|
|
int l;
|
|
|
|
|
2010-08-26 21:43:27 +04:00
|
|
|
mutex_destroy(&zd->zd_dirobj_lock);
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_destroy(&zd->zd_zilog_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (l = 0; l < ZTEST_OBJECT_LOCKS; l++)
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_rll_destroy(&zd->zd_object_lock[l]);
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (l = 0; l < ZTEST_RANGE_LOCKS; l++)
|
2018-10-02 01:13:12 +03:00
|
|
|
ztest_rll_destroy(&zd->zd_range_lock[l]);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
#define TXG_MIGHTWAIT (ztest_random(10) == 0 ? TXG_NOWAIT : TXG_WAIT)
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static uint64_t
|
|
|
|
ztest_tx_assign(dmu_tx_t *tx, uint64_t txg_how, const char *tag)
|
|
|
|
{
|
|
|
|
uint64_t txg;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Attempt to assign tx to some transaction group.
|
|
|
|
*/
|
|
|
|
error = dmu_tx_assign(tx, txg_how);
|
|
|
|
if (error) {
|
|
|
|
if (error == ERESTART) {
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(txg_how, ==, TXG_NOWAIT);
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_tx_wait(tx);
|
|
|
|
} else {
|
|
|
|
ASSERT3U(error, ==, ENOSPC);
|
|
|
|
ztest_record_enospc(tag);
|
|
|
|
}
|
|
|
|
dmu_tx_abort(tx);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
txg = dmu_tx_get_txg(tx);
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(txg, !=, 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
return (txg);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
ztest_bt_generate(ztest_block_tag_t *bt, objset_t *os, uint64_t object,
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
uint64_t dnodesize, uint64_t offset, uint64_t gen, uint64_t txg,
|
|
|
|
uint64_t crtxg)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
|
|
|
bt->bt_magic = BT_MAGIC;
|
|
|
|
bt->bt_objset = dmu_objset_id(os);
|
|
|
|
bt->bt_object = object;
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
bt->bt_dnodesize = dnodesize;
|
2010-05-29 00:45:14 +04:00
|
|
|
bt->bt_offset = offset;
|
|
|
|
bt->bt_gen = gen;
|
|
|
|
bt->bt_txg = txg;
|
|
|
|
bt->bt_crtxg = crtxg;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
ztest_bt_verify(ztest_block_tag_t *bt, objset_t *os, uint64_t object,
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
uint64_t dnodesize, uint64_t offset, uint64_t gen, uint64_t txg,
|
|
|
|
uint64_t crtxg)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2014-06-06 01:19:08 +04:00
|
|
|
ASSERT3U(bt->bt_magic, ==, BT_MAGIC);
|
|
|
|
ASSERT3U(bt->bt_objset, ==, dmu_objset_id(os));
|
|
|
|
ASSERT3U(bt->bt_object, ==, object);
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ASSERT3U(bt->bt_dnodesize, ==, dnodesize);
|
2014-06-06 01:19:08 +04:00
|
|
|
ASSERT3U(bt->bt_offset, ==, offset);
|
|
|
|
ASSERT3U(bt->bt_gen, <=, gen);
|
|
|
|
ASSERT3U(bt->bt_txg, <=, txg);
|
|
|
|
ASSERT3U(bt->bt_crtxg, ==, crtxg);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static ztest_block_tag_t *
|
|
|
|
ztest_bt_bonus(dmu_buf_t *db)
|
|
|
|
{
|
|
|
|
dmu_object_info_t doi;
|
|
|
|
ztest_block_tag_t *bt;
|
|
|
|
|
|
|
|
dmu_object_info_from_db(db, &doi);
|
|
|
|
ASSERT3U(doi.doi_bonus_size, <=, db->db_size);
|
|
|
|
ASSERT3U(doi.doi_bonus_size, >=, sizeof (*bt));
|
|
|
|
bt = (void *)((char *)db->db_data + doi.doi_bonus_size - sizeof (*bt));
|
|
|
|
|
|
|
|
return (bt);
|
|
|
|
}
|
|
|
|
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
/*
|
|
|
|
* Generate a token to fill up unused bonus buffer space. Try to make
|
|
|
|
* it unique to the object, generation, and offset to verify that data
|
|
|
|
* is not getting overwritten by data from other dnodes.
|
|
|
|
*/
|
|
|
|
#define ZTEST_BONUS_FILL_TOKEN(obj, ds, gen, offset) \
|
|
|
|
(((ds) << 48) | ((gen) << 32) | ((obj) << 8) | (offset))
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Fill up the unused bonus buffer region before the block tag with a
|
|
|
|
* verifiable pattern. Filling the whole bonus area with non-zero data
|
|
|
|
* helps ensure that all dnode traversal code properly skips the
|
|
|
|
* interior regions of large dnodes.
|
|
|
|
*/
|
2020-06-15 21:30:37 +03:00
|
|
|
static void
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_fill_unused_bonus(dmu_buf_t *db, void *end, uint64_t obj,
|
|
|
|
objset_t *os, uint64_t gen)
|
|
|
|
{
|
|
|
|
uint64_t *bonusp;
|
|
|
|
|
|
|
|
ASSERT(IS_P2ALIGNED((char *)end - (char *)db->db_data, 8));
|
|
|
|
|
|
|
|
for (bonusp = db->db_data; bonusp < (uint64_t *)end; bonusp++) {
|
|
|
|
uint64_t token = ZTEST_BONUS_FILL_TOKEN(obj, dmu_objset_id(os),
|
|
|
|
gen, bonusp - (uint64_t *)db->db_data);
|
|
|
|
*bonusp = token;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that the unused area of a bonus buffer is filled with the
|
|
|
|
* expected tokens.
|
|
|
|
*/
|
2020-06-15 21:30:37 +03:00
|
|
|
static void
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_verify_unused_bonus(dmu_buf_t *db, void *end, uint64_t obj,
|
|
|
|
objset_t *os, uint64_t gen)
|
|
|
|
{
|
|
|
|
uint64_t *bonusp;
|
|
|
|
|
|
|
|
for (bonusp = db->db_data; bonusp < (uint64_t *)end; bonusp++) {
|
|
|
|
uint64_t token = ZTEST_BONUS_FILL_TOKEN(obj, dmu_objset_id(os),
|
|
|
|
gen, bonusp - (uint64_t *)db->db_data);
|
|
|
|
VERIFY3U(*bonusp, ==, token);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* ZIL logging ops
|
|
|
|
*/
|
|
|
|
|
|
|
|
#define lrz_type lr_mode
|
|
|
|
#define lrz_blocksize lr_uid
|
|
|
|
#define lrz_ibshift lr_gid
|
|
|
|
#define lrz_bonustype lr_rdev
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
#define lrz_dnodesize lr_crtime[1]
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
static void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_log_create(ztest_ds_t *zd, dmu_tx_t *tx, lr_create_t *lr)
|
|
|
|
{
|
|
|
|
char *name = (void *)(lr + 1); /* name follows lr */
|
|
|
|
size_t namesize = strlen(name) + 1;
|
|
|
|
itx_t *itx;
|
|
|
|
|
|
|
|
if (zil_replaying(zd->zd_zilog, tx))
|
2010-08-27 01:24:34 +04:00
|
|
|
return;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
itx = zil_itx_create(TX_CREATE, sizeof (*lr) + namesize);
|
|
|
|
bcopy(&lr->lr_common + 1, &itx->itx_lr + 1,
|
|
|
|
sizeof (*lr) + namesize - sizeof (lr_t));
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
zil_itx_assign(zd->zd_zilog, itx, tx);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
static void
|
|
|
|
ztest_log_remove(ztest_ds_t *zd, dmu_tx_t *tx, lr_remove_t *lr, uint64_t object)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
|
|
|
char *name = (void *)(lr + 1); /* name follows lr */
|
|
|
|
size_t namesize = strlen(name) + 1;
|
|
|
|
itx_t *itx;
|
|
|
|
|
|
|
|
if (zil_replaying(zd->zd_zilog, tx))
|
2010-08-27 01:24:34 +04:00
|
|
|
return;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
itx = zil_itx_create(TX_REMOVE, sizeof (*lr) + namesize);
|
|
|
|
bcopy(&lr->lr_common + 1, &itx->itx_lr + 1,
|
|
|
|
sizeof (*lr) + namesize - sizeof (lr_t));
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
itx->itx_oid = object;
|
|
|
|
zil_itx_assign(zd->zd_zilog, itx, tx);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
static void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_log_write(ztest_ds_t *zd, dmu_tx_t *tx, lr_write_t *lr)
|
|
|
|
{
|
|
|
|
itx_t *itx;
|
|
|
|
itx_wr_state_t write_state = ztest_random(WR_NUM_STATES);
|
|
|
|
|
|
|
|
if (zil_replaying(zd->zd_zilog, tx))
|
2010-08-27 01:24:34 +04:00
|
|
|
return;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2019-06-10 21:48:42 +03:00
|
|
|
if (lr->lr_length > zil_max_log_data(zd->zd_zilog))
|
2010-05-29 00:45:14 +04:00
|
|
|
write_state = WR_INDIRECT;
|
|
|
|
|
|
|
|
itx = zil_itx_create(TX_WRITE,
|
|
|
|
sizeof (*lr) + (write_state == WR_COPIED ? lr->lr_length : 0));
|
|
|
|
|
|
|
|
if (write_state == WR_COPIED &&
|
|
|
|
dmu_read(zd->zd_os, lr->lr_foid, lr->lr_offset, lr->lr_length,
|
|
|
|
((lr_write_t *)&itx->itx_lr) + 1, DMU_READ_NO_PREFETCH) != 0) {
|
|
|
|
zil_itx_destroy(itx);
|
|
|
|
itx = zil_itx_create(TX_WRITE, sizeof (*lr));
|
|
|
|
write_state = WR_NEED_COPY;
|
|
|
|
}
|
|
|
|
itx->itx_private = zd;
|
|
|
|
itx->itx_wr_state = write_state;
|
|
|
|
itx->itx_sync = (ztest_random(8) == 0);
|
|
|
|
|
|
|
|
bcopy(&lr->lr_common + 1, &itx->itx_lr + 1,
|
|
|
|
sizeof (*lr) - sizeof (lr_t));
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
zil_itx_assign(zd->zd_zilog, itx, tx);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
static void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_log_truncate(ztest_ds_t *zd, dmu_tx_t *tx, lr_truncate_t *lr)
|
|
|
|
{
|
|
|
|
itx_t *itx;
|
|
|
|
|
|
|
|
if (zil_replaying(zd->zd_zilog, tx))
|
2010-08-27 01:24:34 +04:00
|
|
|
return;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
itx = zil_itx_create(TX_TRUNCATE, sizeof (*lr));
|
|
|
|
bcopy(&lr->lr_common + 1, &itx->itx_lr + 1,
|
|
|
|
sizeof (*lr) - sizeof (lr_t));
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
itx->itx_sync = B_FALSE;
|
|
|
|
zil_itx_assign(zd->zd_zilog, itx, tx);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
static void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_log_setattr(ztest_ds_t *zd, dmu_tx_t *tx, lr_setattr_t *lr)
|
|
|
|
{
|
|
|
|
itx_t *itx;
|
|
|
|
|
|
|
|
if (zil_replaying(zd->zd_zilog, tx))
|
2010-08-27 01:24:34 +04:00
|
|
|
return;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
itx = zil_itx_create(TX_SETATTR, sizeof (*lr));
|
|
|
|
bcopy(&lr->lr_common + 1, &itx->itx_lr + 1,
|
|
|
|
sizeof (*lr) - sizeof (lr_t));
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
itx->itx_sync = B_FALSE;
|
|
|
|
zil_itx_assign(zd->zd_zilog, itx, tx);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ZIL replay ops
|
|
|
|
*/
|
|
|
|
static int
|
2017-10-27 22:46:35 +03:00
|
|
|
ztest_replay_create(void *arg1, void *arg2, boolean_t byteswap)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2017-10-27 22:46:35 +03:00
|
|
|
ztest_ds_t *zd = arg1;
|
|
|
|
lr_create_t *lr = arg2;
|
2010-05-29 00:45:14 +04:00
|
|
|
char *name = (void *)(lr + 1); /* name follows lr */
|
|
|
|
objset_t *os = zd->zd_os;
|
|
|
|
ztest_block_tag_t *bbt;
|
|
|
|
dmu_buf_t *db;
|
|
|
|
dmu_tx_t *tx;
|
|
|
|
uint64_t txg;
|
|
|
|
int error = 0;
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
int bonuslen;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (byteswap)
|
|
|
|
byteswap_uint64_array(lr, sizeof (*lr));
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(lr->lr_doid, ==, ZTEST_DIROBJ);
|
|
|
|
ASSERT3S(name[0], !=, '\0');
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
tx = dmu_tx_create(os);
|
|
|
|
|
|
|
|
dmu_tx_hold_zap(tx, lr->lr_doid, B_TRUE, name);
|
|
|
|
|
|
|
|
if (lr->lrz_type == DMU_OT_ZAP_OTHER) {
|
|
|
|
dmu_tx_hold_zap(tx, DMU_NEW_OBJECT, B_TRUE, NULL);
|
|
|
|
} else {
|
|
|
|
dmu_tx_hold_bonus(tx, DMU_NEW_OBJECT);
|
|
|
|
}
|
|
|
|
|
|
|
|
txg = ztest_tx_assign(tx, TXG_WAIT, FTAG);
|
|
|
|
if (txg == 0)
|
|
|
|
return (ENOSPC);
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(dmu_objset_zil(os)->zl_replay, ==, !!lr->lr_foid);
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
bonuslen = DN_BONUS_SIZE(lr->lrz_dnodesize);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (lr->lrz_type == DMU_OT_ZAP_OTHER) {
|
|
|
|
if (lr->lr_foid == 0) {
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
lr->lr_foid = zap_create_dnsize(os,
|
2010-05-29 00:45:14 +04:00
|
|
|
lr->lrz_type, lr->lrz_bonustype,
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
bonuslen, lr->lrz_dnodesize, tx);
|
2010-05-29 00:45:14 +04:00
|
|
|
} else {
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
error = zap_create_claim_dnsize(os, lr->lr_foid,
|
2010-05-29 00:45:14 +04:00
|
|
|
lr->lrz_type, lr->lrz_bonustype,
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
bonuslen, lr->lrz_dnodesize, tx);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
} else {
|
|
|
|
if (lr->lr_foid == 0) {
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
lr->lr_foid = dmu_object_alloc_dnsize(os,
|
2010-05-29 00:45:14 +04:00
|
|
|
lr->lrz_type, 0, lr->lrz_bonustype,
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
bonuslen, lr->lrz_dnodesize, tx);
|
2010-05-29 00:45:14 +04:00
|
|
|
} else {
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
error = dmu_object_claim_dnsize(os, lr->lr_foid,
|
2010-05-29 00:45:14 +04:00
|
|
|
lr->lrz_type, 0, lr->lrz_bonustype,
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
bonuslen, lr->lrz_dnodesize, tx);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (error) {
|
|
|
|
ASSERT3U(error, ==, EEXIST);
|
|
|
|
ASSERT(zd->zd_zilog->zl_replay);
|
|
|
|
dmu_tx_commit(tx);
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(lr->lr_foid, !=, 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (lr->lrz_type != DMU_OT_ZAP_OTHER)
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_object_set_blocksize(os, lr->lr_foid,
|
2010-05-29 00:45:14 +04:00
|
|
|
lr->lrz_blocksize, lr->lrz_ibshift, tx));
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_bonus_hold(os, lr->lr_foid, FTAG, &db));
|
2010-05-29 00:45:14 +04:00
|
|
|
bbt = ztest_bt_bonus(db);
|
|
|
|
dmu_buf_will_dirty(db, tx);
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_bt_generate(bbt, os, lr->lr_foid, lr->lrz_dnodesize, -1ULL,
|
|
|
|
lr->lr_gen, txg, txg);
|
|
|
|
ztest_fill_unused_bonus(db, bbt, lr->lr_foid, os, lr->lr_gen);
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_buf_rele(db, FTAG);
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(zap_add(os, lr->lr_doid, name, sizeof (uint64_t), 1,
|
2010-05-29 00:45:14 +04:00
|
|
|
&lr->lr_foid, tx));
|
|
|
|
|
|
|
|
(void) ztest_log_create(zd, tx, lr);
|
|
|
|
|
|
|
|
dmu_tx_commit(tx);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
2017-10-27 22:46:35 +03:00
|
|
|
ztest_replay_remove(void *arg1, void *arg2, boolean_t byteswap)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2017-10-27 22:46:35 +03:00
|
|
|
ztest_ds_t *zd = arg1;
|
|
|
|
lr_remove_t *lr = arg2;
|
2010-05-29 00:45:14 +04:00
|
|
|
char *name = (void *)(lr + 1); /* name follows lr */
|
|
|
|
objset_t *os = zd->zd_os;
|
|
|
|
dmu_object_info_t doi;
|
|
|
|
dmu_tx_t *tx;
|
|
|
|
uint64_t object, txg;
|
|
|
|
|
|
|
|
if (byteswap)
|
|
|
|
byteswap_uint64_array(lr, sizeof (*lr));
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(lr->lr_doid, ==, ZTEST_DIROBJ);
|
|
|
|
ASSERT3S(name[0], !=, '\0');
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(
|
2010-05-29 00:45:14 +04:00
|
|
|
zap_lookup(os, lr->lr_doid, name, sizeof (object), 1, &object));
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(object, !=, 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
ztest_object_lock(zd, object, RL_WRITER);
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_object_info(os, object, &doi));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
tx = dmu_tx_create(os);
|
|
|
|
|
|
|
|
dmu_tx_hold_zap(tx, lr->lr_doid, B_FALSE, name);
|
|
|
|
dmu_tx_hold_free(tx, object, 0, DMU_OBJECT_END);
|
|
|
|
|
|
|
|
txg = ztest_tx_assign(tx, TXG_WAIT, FTAG);
|
|
|
|
if (txg == 0) {
|
|
|
|
ztest_object_unlock(zd, object);
|
|
|
|
return (ENOSPC);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (doi.doi_type == DMU_OT_ZAP_OTHER) {
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(zap_destroy(os, object, tx));
|
2010-05-29 00:45:14 +04:00
|
|
|
} else {
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_object_free(os, object, tx));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(zap_remove(os, lr->lr_doid, name, tx));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
(void) ztest_log_remove(zd, tx, lr, object);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
dmu_tx_commit(tx);
|
|
|
|
|
|
|
|
ztest_object_unlock(zd, object);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
2017-10-27 22:46:35 +03:00
|
|
|
ztest_replay_write(void *arg1, void *arg2, boolean_t byteswap)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2017-10-27 22:46:35 +03:00
|
|
|
ztest_ds_t *zd = arg1;
|
|
|
|
lr_write_t *lr = arg2;
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *os = zd->zd_os;
|
|
|
|
void *data = lr + 1; /* data follows lr */
|
|
|
|
uint64_t offset, length;
|
|
|
|
ztest_block_tag_t *bt = data;
|
|
|
|
ztest_block_tag_t *bbt;
|
|
|
|
uint64_t gen, txg, lrtxg, crtxg;
|
|
|
|
dmu_object_info_t doi;
|
|
|
|
dmu_tx_t *tx;
|
|
|
|
dmu_buf_t *db;
|
|
|
|
arc_buf_t *abuf = NULL;
|
2018-10-02 01:13:12 +03:00
|
|
|
rl_t *rl;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (byteswap)
|
|
|
|
byteswap_uint64_array(lr, sizeof (*lr));
|
|
|
|
|
|
|
|
offset = lr->lr_offset;
|
|
|
|
length = lr->lr_length;
|
|
|
|
|
|
|
|
/* If it's a dmu_sync() block, write the whole block */
|
|
|
|
if (lr->lr_common.lrc_reclen == sizeof (lr_write_t)) {
|
|
|
|
uint64_t blocksize = BP_GET_LSIZE(&lr->lr_blkptr);
|
|
|
|
if (length < blocksize) {
|
|
|
|
offset -= offset % blocksize;
|
|
|
|
length = blocksize;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (bt->bt_magic == BSWAP_64(BT_MAGIC))
|
|
|
|
byteswap_uint64_array(bt, sizeof (*bt));
|
|
|
|
|
|
|
|
if (bt->bt_magic != BT_MAGIC)
|
|
|
|
bt = NULL;
|
|
|
|
|
|
|
|
ztest_object_lock(zd, lr->lr_foid, RL_READER);
|
|
|
|
rl = ztest_range_lock(zd, lr->lr_foid, offset, length, RL_WRITER);
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_bonus_hold(os, lr->lr_foid, FTAG, &db));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
dmu_object_info_from_db(db, &doi);
|
|
|
|
|
|
|
|
bbt = ztest_bt_bonus(db);
|
|
|
|
ASSERT3U(bbt->bt_magic, ==, BT_MAGIC);
|
|
|
|
gen = bbt->bt_gen;
|
|
|
|
crtxg = bbt->bt_crtxg;
|
|
|
|
lrtxg = lr->lr_common.lrc_txg;
|
|
|
|
|
|
|
|
tx = dmu_tx_create(os);
|
|
|
|
|
|
|
|
dmu_tx_hold_write(tx, lr->lr_foid, offset, length);
|
|
|
|
|
|
|
|
if (ztest_random(8) == 0 && length == doi.doi_data_block_size &&
|
|
|
|
P2PHASE(offset, length) == 0)
|
|
|
|
abuf = dmu_request_arcbuf(db, length);
|
|
|
|
|
|
|
|
txg = ztest_tx_assign(tx, TXG_WAIT, FTAG);
|
|
|
|
if (txg == 0) {
|
|
|
|
if (abuf != NULL)
|
|
|
|
dmu_return_arcbuf(abuf);
|
|
|
|
dmu_buf_rele(db, FTAG);
|
2018-10-02 01:13:12 +03:00
|
|
|
ztest_range_unlock(rl);
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_object_unlock(zd, lr->lr_foid);
|
|
|
|
return (ENOSPC);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (bt != NULL) {
|
|
|
|
/*
|
|
|
|
* Usually, verify the old data before writing new data --
|
|
|
|
* but not always, because we also want to verify correct
|
|
|
|
* behavior when the data was not recently read into cache.
|
|
|
|
*/
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT0(offset % doi.doi_data_block_size);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (ztest_random(4) != 0) {
|
|
|
|
int prefetch = ztest_random(2) ?
|
|
|
|
DMU_READ_PREFETCH : DMU_READ_NO_PREFETCH;
|
|
|
|
ztest_block_tag_t rbt;
|
|
|
|
|
|
|
|
VERIFY(dmu_read(os, lr->lr_foid, offset,
|
|
|
|
sizeof (rbt), &rbt, prefetch) == 0);
|
|
|
|
if (rbt.bt_magic == BT_MAGIC) {
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_bt_verify(&rbt, os, lr->lr_foid, 0,
|
2010-05-29 00:45:14 +04:00
|
|
|
offset, gen, txg, crtxg);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Writes can appear to be newer than the bonus buffer because
|
|
|
|
* the ztest_get_data() callback does a dmu_read() of the
|
|
|
|
* open-context data, which may be different than the data
|
|
|
|
* as it was when the write was generated.
|
|
|
|
*/
|
|
|
|
if (zd->zd_zilog->zl_replay) {
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_bt_verify(bt, os, lr->lr_foid, 0, offset,
|
2010-05-29 00:45:14 +04:00
|
|
|
MAX(gen, bt->bt_gen), MAX(txg, lrtxg),
|
|
|
|
bt->bt_crtxg);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Set the bt's gen/txg to the bonus buffer's gen/txg
|
|
|
|
* so that all of the usual ASSERTs will work.
|
|
|
|
*/
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_bt_generate(bt, os, lr->lr_foid, 0, offset, gen, txg,
|
|
|
|
crtxg);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
if (abuf == NULL) {
|
|
|
|
dmu_write(os, lr->lr_foid, offset, length, data, tx);
|
|
|
|
} else {
|
|
|
|
bcopy(data, abuf->b_data, length);
|
2017-09-28 18:49:13 +03:00
|
|
|
dmu_assign_arcbuf_by_dbuf(db, offset, abuf, tx);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
(void) ztest_log_write(zd, tx, lr);
|
|
|
|
|
|
|
|
dmu_buf_rele(db, FTAG);
|
|
|
|
|
|
|
|
dmu_tx_commit(tx);
|
|
|
|
|
2018-10-02 01:13:12 +03:00
|
|
|
ztest_range_unlock(rl);
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_object_unlock(zd, lr->lr_foid);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
2017-10-27 22:46:35 +03:00
|
|
|
ztest_replay_truncate(void *arg1, void *arg2, boolean_t byteswap)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2017-10-27 22:46:35 +03:00
|
|
|
ztest_ds_t *zd = arg1;
|
|
|
|
lr_truncate_t *lr = arg2;
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *os = zd->zd_os;
|
|
|
|
dmu_tx_t *tx;
|
|
|
|
uint64_t txg;
|
2018-10-02 01:13:12 +03:00
|
|
|
rl_t *rl;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (byteswap)
|
|
|
|
byteswap_uint64_array(lr, sizeof (*lr));
|
|
|
|
|
|
|
|
ztest_object_lock(zd, lr->lr_foid, RL_READER);
|
|
|
|
rl = ztest_range_lock(zd, lr->lr_foid, lr->lr_offset, lr->lr_length,
|
|
|
|
RL_WRITER);
|
|
|
|
|
|
|
|
tx = dmu_tx_create(os);
|
|
|
|
|
|
|
|
dmu_tx_hold_free(tx, lr->lr_foid, lr->lr_offset, lr->lr_length);
|
|
|
|
|
|
|
|
txg = ztest_tx_assign(tx, TXG_WAIT, FTAG);
|
|
|
|
if (txg == 0) {
|
2018-10-02 01:13:12 +03:00
|
|
|
ztest_range_unlock(rl);
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_object_unlock(zd, lr->lr_foid);
|
|
|
|
return (ENOSPC);
|
|
|
|
}
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_free_range(os, lr->lr_foid, lr->lr_offset,
|
|
|
|
lr->lr_length, tx));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
(void) ztest_log_truncate(zd, tx, lr);
|
|
|
|
|
|
|
|
dmu_tx_commit(tx);
|
|
|
|
|
2018-10-02 01:13:12 +03:00
|
|
|
ztest_range_unlock(rl);
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_object_unlock(zd, lr->lr_foid);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
2017-10-27 22:46:35 +03:00
|
|
|
ztest_replay_setattr(void *arg1, void *arg2, boolean_t byteswap)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2017-10-27 22:46:35 +03:00
|
|
|
ztest_ds_t *zd = arg1;
|
|
|
|
lr_setattr_t *lr = arg2;
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *os = zd->zd_os;
|
|
|
|
dmu_tx_t *tx;
|
|
|
|
dmu_buf_t *db;
|
|
|
|
ztest_block_tag_t *bbt;
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
uint64_t txg, lrtxg, crtxg, dnodesize;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (byteswap)
|
|
|
|
byteswap_uint64_array(lr, sizeof (*lr));
|
|
|
|
|
|
|
|
ztest_object_lock(zd, lr->lr_foid, RL_WRITER);
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_bonus_hold(os, lr->lr_foid, FTAG, &db));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
tx = dmu_tx_create(os);
|
|
|
|
dmu_tx_hold_bonus(tx, lr->lr_foid);
|
|
|
|
|
|
|
|
txg = ztest_tx_assign(tx, TXG_WAIT, FTAG);
|
|
|
|
if (txg == 0) {
|
|
|
|
dmu_buf_rele(db, FTAG);
|
|
|
|
ztest_object_unlock(zd, lr->lr_foid);
|
|
|
|
return (ENOSPC);
|
|
|
|
}
|
|
|
|
|
|
|
|
bbt = ztest_bt_bonus(db);
|
|
|
|
ASSERT3U(bbt->bt_magic, ==, BT_MAGIC);
|
|
|
|
crtxg = bbt->bt_crtxg;
|
|
|
|
lrtxg = lr->lr_common.lrc_txg;
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
dnodesize = bbt->bt_dnodesize;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (zd->zd_zilog->zl_replay) {
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(lr->lr_size, !=, 0);
|
|
|
|
ASSERT3U(lr->lr_mode, !=, 0);
|
|
|
|
ASSERT3U(lrtxg, !=, 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Randomly change the size and increment the generation.
|
|
|
|
*/
|
|
|
|
lr->lr_size = (ztest_random(db->db_size / sizeof (*bbt)) + 1) *
|
|
|
|
sizeof (*bbt);
|
|
|
|
lr->lr_mode = bbt->bt_gen + 1;
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT0(lrtxg);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that the current bonus buffer is not newer than our txg.
|
|
|
|
*/
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_bt_verify(bbt, os, lr->lr_foid, dnodesize, -1ULL, lr->lr_mode,
|
2010-05-29 00:45:14 +04:00
|
|
|
MAX(txg, lrtxg), crtxg);
|
|
|
|
|
|
|
|
dmu_buf_will_dirty(db, tx);
|
|
|
|
|
|
|
|
ASSERT3U(lr->lr_size, >=, sizeof (*bbt));
|
|
|
|
ASSERT3U(lr->lr_size, <=, db->db_size);
|
2013-05-11 01:17:03 +04:00
|
|
|
VERIFY0(dmu_set_bonus(db, lr->lr_size, tx));
|
2010-05-29 00:45:14 +04:00
|
|
|
bbt = ztest_bt_bonus(db);
|
|
|
|
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_bt_generate(bbt, os, lr->lr_foid, dnodesize, -1ULL, lr->lr_mode,
|
|
|
|
txg, crtxg);
|
|
|
|
ztest_fill_unused_bonus(db, bbt, lr->lr_foid, os, bbt->bt_gen);
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_buf_rele(db, FTAG);
|
|
|
|
|
|
|
|
(void) ztest_log_setattr(zd, tx, lr);
|
|
|
|
|
|
|
|
dmu_tx_commit(tx);
|
|
|
|
|
|
|
|
ztest_object_unlock(zd, lr->lr_foid);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
zil_replay_func_t *ztest_replay_vector[TX_MAX_TYPE] = {
|
|
|
|
NULL, /* 0 no such transaction type */
|
|
|
|
ztest_replay_create, /* TX_CREATE */
|
|
|
|
NULL, /* TX_MKDIR */
|
|
|
|
NULL, /* TX_MKXATTR */
|
|
|
|
NULL, /* TX_SYMLINK */
|
|
|
|
ztest_replay_remove, /* TX_REMOVE */
|
|
|
|
NULL, /* TX_RMDIR */
|
|
|
|
NULL, /* TX_LINK */
|
|
|
|
NULL, /* TX_RENAME */
|
|
|
|
ztest_replay_write, /* TX_WRITE */
|
|
|
|
ztest_replay_truncate, /* TX_TRUNCATE */
|
|
|
|
ztest_replay_setattr, /* TX_SETATTR */
|
|
|
|
NULL, /* TX_ACL */
|
|
|
|
NULL, /* TX_CREATE_ACL */
|
|
|
|
NULL, /* TX_CREATE_ATTR */
|
|
|
|
NULL, /* TX_CREATE_ACL_ATTR */
|
|
|
|
NULL, /* TX_MKDIR_ACL */
|
|
|
|
NULL, /* TX_MKDIR_ATTR */
|
|
|
|
NULL, /* TX_MKDIR_ACL_ATTR */
|
|
|
|
NULL, /* TX_WRITE2 */
|
2010-05-29 00:45:14 +04:00
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ZIL get_data callbacks
|
|
|
|
*/
|
|
|
|
|
OpenZFS 9962 - zil_commit should omit cache thrash
As a result of the changes made in 8585, it's possible for an excessive
amount of vdev flush commands to be issued under some workloads.
Specifically, when the workload consists of mostly async write activity,
interspersed with some sync write and/or fsync activity, we can end up
issuing more flush commands to the underlying storage than is actually
necessary. As a result of these flush commands, the write latency and
overall throughput of the pool can be poorly impacted (latency
increases, throughput decreases).
Currently, any time an lwb completes, the vdev(s) written to as a result
of that lwb will be issued a flush command. The intenion is so the data
written to that vdev is on stable storage, prior to communicating to any
waiting threads that their data is safe on disk.
The problem with this scheme, is that sometimes an lwb will not have any
threads waiting for it to complete. This can occur when there's async
activity that gets "converted" to sync requests, as a result of calling
the zil_async_to_sync() function via zil_commit_impl(). When this
occurs, the current code may issue many lwbs that don't have waiters
associated with them, resulting in many flush commands, potentially to
the same vdev(s).
For example, given a pool with a single vdev, and a single fsync() call
that results in 10 lwbs being written out (e.g. due to other async
writes), that will result in 10 flush commands to that single vdev (a
flush issued after each lwb write completes). Ideally, we'd only issue a
single flush command to that vdev, after all 10 lwb writes completed.
Further, and most important as it pertains to this change, since the
flush commands are often very impactful to the performance of the pool's
underlying storage, unnecessarily issuing these flush commands can
poorly impact the performance of the lwb writes themselves. Thus, we
need to avoid issuing flush commands when possible, in order to acheive
the best possible performance out of the pool's underlying storage.
This change attempts to address this problem by changing the ZIL's logic
to only issue a vdev flush command when it detects an lwb that has a
thread waiting for it to complete. When an lwb does not have threads
waiting for it, the responsibility of issuing the flush command to the
vdevs involved with that lwb's write is passed on to the "next" lwb.
It's only once a write for an lwb with waiters completes, do we issue
the vdev flush command(s). As a result, now when we issue the flush(s),
we will issue them to the vdevs involved with that specific lwb's write,
but potentially also to vdevs involved with "previous" lwb writes (i.e.
if the previous lwbs did not have waiters associated with them).
Thus, in our prior example with 10 lwbs, it's only once the last lwb
completes (which will be the lwb containing the waiter for the thread
that called fsync) will we issue the vdev flush command; all of the
other lwbs will find they have no waiters, so they'll pass the
responsibility of the flush to the "next" lwb (until reaching the last
lwb that has the waiter).
Porting Notes:
* Reconciled conflicts with the fastwrite feature.
Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
Ported-by: Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/9962
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/545190c6
Closes #8188
2018-10-24 00:14:27 +03:00
|
|
|
/* ARGSUSED */
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
ztest_get_done(zgd_t *zgd, int error)
|
|
|
|
{
|
2018-10-02 01:13:12 +03:00
|
|
|
ztest_ds_t *zd = zgd->zgd_private;
|
|
|
|
uint64_t object = ((rl_t *)zgd->zgd_lr)->rl_object;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (zgd->zgd_db)
|
|
|
|
dmu_buf_rele(zgd->zgd_db, zgd);
|
|
|
|
|
2018-10-02 01:13:12 +03:00
|
|
|
ztest_range_unlock((rl_t *)zgd->zgd_lr);
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_object_unlock(zd, object);
|
|
|
|
|
|
|
|
umem_free(zgd, sizeof (*zgd));
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
2021-03-20 08:53:31 +03:00
|
|
|
ztest_get_data(void *arg, uint64_t arg2, lr_write_t *lr, char *buf,
|
|
|
|
struct lwb *lwb, zio_t *zio)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
|
|
|
ztest_ds_t *zd = arg;
|
|
|
|
objset_t *os = zd->zd_os;
|
|
|
|
uint64_t object = lr->lr_foid;
|
|
|
|
uint64_t offset = lr->lr_offset;
|
|
|
|
uint64_t size = lr->lr_length;
|
|
|
|
uint64_t txg = lr->lr_common.lrc_txg;
|
|
|
|
uint64_t crtxg;
|
|
|
|
dmu_object_info_t doi;
|
|
|
|
dmu_buf_t *db;
|
|
|
|
zgd_t *zgd;
|
|
|
|
int error;
|
|
|
|
|
OpenZFS 8585 - improve batching done in zil_commit()
Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Prakash Surya <prakash.surya@delphix.com>
Problem
=======
The current implementation of zil_commit() can introduce significant
latency, beyond what is inherent due to the latency of the underlying
storage. The additional latency comes from two main problems:
1. When there's outstanding ZIL blocks being written (i.e. there's
already a "writer thread" in progress), then any new calls to
zil_commit() will block waiting for the currently oustanding ZIL
blocks to complete. The blocks written for each "writer thread" is
coined a "batch", and there can only ever be a single "batch" being
written at a time. When a batch is being written, any new ZIL
transactions will have to wait for the next batch to be written,
which won't occur until the current batch finishes.
As a result, the underlying storage may not be used as efficiently
as possible. While "new" threads enter zil_commit() and are blocked
waiting for the next batch, it's possible that the underlying
storage isn't fully utilized by the current batch of ZIL blocks. In
that case, it'd be better to allow these new threads to generate
(and issue) a new ZIL block, such that it could be serviced by the
underlying storage concurrently with the other ZIL blocks that are
being serviced.
2. Any call to zil_commit() must wait for all ZIL blocks in its "batch"
to complete, prior to zil_commit() returning. The size of any given
batch is proportional to the number of ZIL transaction in the queue
at the time that the batch starts processing the queue; which
doesn't occur until the previous batch completes. Thus, if there's a
lot of transactions in the queue, the batch could be composed of
many ZIL blocks, and each call to zil_commit() will have to wait for
all of these writes to complete (even if the thread calling
zil_commit() only cared about one of the transactions in the batch).
To further complicate the situation, these two issues result in the
following side effect:
3. If a given batch takes longer to complete than normal, this results
in larger batch sizes, which then take longer to complete and
further drive up the latency of zil_commit(). This can occur for a
number of reasons, including (but not limited to): transient changes
in the workload, and storage latency irregularites.
Solution
========
The solution attempted by this change has the following goals:
1. no on-disk changes; maintain current on-disk format.
2. modify the "batch size" to be equal to the "ZIL block size".
3. allow new batches to be generated and issued to disk, while there's
already batches being serviced by the disk.
4. allow zil_commit() to wait for as few ZIL blocks as possible.
5. use as few ZIL blocks as possible, for the same amount of ZIL
transactions, without introducing significant latency to any
individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks.
In theory, with these goals met, the new allgorithm will allow the
following improvements:
1. new ZIL blocks can be generated and issued, while there's already
oustanding ZIL blocks being serviced by the storage.
2. the latency of zil_commit() should be proportional to the underlying
storage latency, rather than the incoming synchronous workload.
Porting Notes
=============
Due to the changes made in commit 119a394ab0, the lifetime of an itx
structure differs than in OpenZFS. Specifically, the itx structure is
kept around until the data associated with the itx is considered to be
safe on disk; this is so that the itx's callback can be called after the
data is committed to stable storage. Since OpenZFS doesn't have this itx
callback mechanism, it's able to destroy the itx structure immediately
after the itx is committed to an lwb (before the lwb is written to
disk).
To support this difference, and to ensure the itx's callbacks can still
be called after the itx's data is on disk, a few changes had to be made:
* A list of itxs was added to the lwb structure. This list contains
all of the itxs that have been committed to the lwb, such that the
callbacks for these itxs can be called from zil_lwb_flush_vdevs_done(),
after the data for the itxs is committed to disk.
* A list of itxs was added on the stack of the zil_process_commit_list()
function; the "nolwb_itxs" list. In some circumstances, an itx may
not be committed to an lwb (e.g. if allocating the "next" ZIL block
on disk fails), so this list is used to keep track of which itxs
fall into this state, such that their callbacks can be called after
the ZIL's writer pipeline is "stalled".
* The logic to actually call the itx's callback was moved into the
zil_itx_destroy() function. Since all consumers of zil_itx_destroy()
were effectively performing the same logic (i.e. if callback is
non-null, call the callback), it seemed like useful code cleanup to
consolidate this logic into a single function.
Additionally, the existing Linux tracepoint infrastructure dealing with
the ZIL's probes and structures had to be updated to reflect these code
changes. Specifically:
* The "zil__cw1" and "zil__cw2" probes were removed, so they had to be
removed from "trace_zil.h" as well.
* Some of the zilog structure's fields were removed, which affected
the tracepoint definitions of the structure.
* New tracepoints had to be added for the following 3 new probes:
* zil__process__commit__itx
* zil__process__normal__itx
* zil__commit__io__error
OpenZFS-issue: https://www.illumos.org/issues/8585
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5d95a3a
Closes #6566
2017-12-05 20:39:16 +03:00
|
|
|
ASSERT3P(lwb, !=, NULL);
|
|
|
|
ASSERT3P(zio, !=, NULL);
|
|
|
|
ASSERT3U(size, !=, 0);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_object_lock(zd, object, RL_READER);
|
|
|
|
error = dmu_bonus_hold(os, object, FTAG, &db);
|
|
|
|
if (error) {
|
|
|
|
ztest_object_unlock(zd, object);
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
crtxg = ztest_bt_bonus(db)->bt_crtxg;
|
|
|
|
|
|
|
|
if (crtxg == 0 || crtxg > txg) {
|
|
|
|
dmu_buf_rele(db, FTAG);
|
|
|
|
ztest_object_unlock(zd, object);
|
|
|
|
return (ENOENT);
|
|
|
|
}
|
|
|
|
|
|
|
|
dmu_object_info_from_db(db, &doi);
|
|
|
|
dmu_buf_rele(db, FTAG);
|
|
|
|
db = NULL;
|
|
|
|
|
|
|
|
zgd = umem_zalloc(sizeof (*zgd), UMEM_NOFAIL);
|
OpenZFS 8585 - improve batching done in zil_commit()
Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Prakash Surya <prakash.surya@delphix.com>
Problem
=======
The current implementation of zil_commit() can introduce significant
latency, beyond what is inherent due to the latency of the underlying
storage. The additional latency comes from two main problems:
1. When there's outstanding ZIL blocks being written (i.e. there's
already a "writer thread" in progress), then any new calls to
zil_commit() will block waiting for the currently oustanding ZIL
blocks to complete. The blocks written for each "writer thread" is
coined a "batch", and there can only ever be a single "batch" being
written at a time. When a batch is being written, any new ZIL
transactions will have to wait for the next batch to be written,
which won't occur until the current batch finishes.
As a result, the underlying storage may not be used as efficiently
as possible. While "new" threads enter zil_commit() and are blocked
waiting for the next batch, it's possible that the underlying
storage isn't fully utilized by the current batch of ZIL blocks. In
that case, it'd be better to allow these new threads to generate
(and issue) a new ZIL block, such that it could be serviced by the
underlying storage concurrently with the other ZIL blocks that are
being serviced.
2. Any call to zil_commit() must wait for all ZIL blocks in its "batch"
to complete, prior to zil_commit() returning. The size of any given
batch is proportional to the number of ZIL transaction in the queue
at the time that the batch starts processing the queue; which
doesn't occur until the previous batch completes. Thus, if there's a
lot of transactions in the queue, the batch could be composed of
many ZIL blocks, and each call to zil_commit() will have to wait for
all of these writes to complete (even if the thread calling
zil_commit() only cared about one of the transactions in the batch).
To further complicate the situation, these two issues result in the
following side effect:
3. If a given batch takes longer to complete than normal, this results
in larger batch sizes, which then take longer to complete and
further drive up the latency of zil_commit(). This can occur for a
number of reasons, including (but not limited to): transient changes
in the workload, and storage latency irregularites.
Solution
========
The solution attempted by this change has the following goals:
1. no on-disk changes; maintain current on-disk format.
2. modify the "batch size" to be equal to the "ZIL block size".
3. allow new batches to be generated and issued to disk, while there's
already batches being serviced by the disk.
4. allow zil_commit() to wait for as few ZIL blocks as possible.
5. use as few ZIL blocks as possible, for the same amount of ZIL
transactions, without introducing significant latency to any
individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks.
In theory, with these goals met, the new allgorithm will allow the
following improvements:
1. new ZIL blocks can be generated and issued, while there's already
oustanding ZIL blocks being serviced by the storage.
2. the latency of zil_commit() should be proportional to the underlying
storage latency, rather than the incoming synchronous workload.
Porting Notes
=============
Due to the changes made in commit 119a394ab0, the lifetime of an itx
structure differs than in OpenZFS. Specifically, the itx structure is
kept around until the data associated with the itx is considered to be
safe on disk; this is so that the itx's callback can be called after the
data is committed to stable storage. Since OpenZFS doesn't have this itx
callback mechanism, it's able to destroy the itx structure immediately
after the itx is committed to an lwb (before the lwb is written to
disk).
To support this difference, and to ensure the itx's callbacks can still
be called after the itx's data is on disk, a few changes had to be made:
* A list of itxs was added to the lwb structure. This list contains
all of the itxs that have been committed to the lwb, such that the
callbacks for these itxs can be called from zil_lwb_flush_vdevs_done(),
after the data for the itxs is committed to disk.
* A list of itxs was added on the stack of the zil_process_commit_list()
function; the "nolwb_itxs" list. In some circumstances, an itx may
not be committed to an lwb (e.g. if allocating the "next" ZIL block
on disk fails), so this list is used to keep track of which itxs
fall into this state, such that their callbacks can be called after
the ZIL's writer pipeline is "stalled".
* The logic to actually call the itx's callback was moved into the
zil_itx_destroy() function. Since all consumers of zil_itx_destroy()
were effectively performing the same logic (i.e. if callback is
non-null, call the callback), it seemed like useful code cleanup to
consolidate this logic into a single function.
Additionally, the existing Linux tracepoint infrastructure dealing with
the ZIL's probes and structures had to be updated to reflect these code
changes. Specifically:
* The "zil__cw1" and "zil__cw2" probes were removed, so they had to be
removed from "trace_zil.h" as well.
* Some of the zilog structure's fields were removed, which affected
the tracepoint definitions of the structure.
* New tracepoints had to be added for the following 3 new probes:
* zil__process__commit__itx
* zil__process__normal__itx
* zil__commit__io__error
OpenZFS-issue: https://www.illumos.org/issues/8585
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5d95a3a
Closes #6566
2017-12-05 20:39:16 +03:00
|
|
|
zgd->zgd_lwb = lwb;
|
2018-10-02 01:13:12 +03:00
|
|
|
zgd->zgd_private = zd;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (buf != NULL) { /* immediate write */
|
2019-11-01 20:37:33 +03:00
|
|
|
zgd->zgd_lr = (struct zfs_locked_range *)ztest_range_lock(zd,
|
2018-10-02 01:13:12 +03:00
|
|
|
object, offset, size, RL_READER);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
error = dmu_read(os, object, offset, size, buf,
|
|
|
|
DMU_READ_NO_PREFETCH);
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT0(error);
|
2010-05-29 00:45:14 +04:00
|
|
|
} else {
|
|
|
|
size = doi.doi_data_block_size;
|
|
|
|
if (ISP2(size)) {
|
|
|
|
offset = P2ALIGN(offset, size);
|
|
|
|
} else {
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(offset, <, size);
|
2010-05-29 00:45:14 +04:00
|
|
|
offset = 0;
|
|
|
|
}
|
|
|
|
|
2019-11-01 20:37:33 +03:00
|
|
|
zgd->zgd_lr = (struct zfs_locked_range *)ztest_range_lock(zd,
|
2018-10-02 01:13:12 +03:00
|
|
|
object, offset, size, RL_READER);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
error = dmu_buf_hold(os, object, offset, zgd, &db,
|
|
|
|
DMU_READ_NO_PREFETCH);
|
|
|
|
|
|
|
|
if (error == 0) {
|
2017-04-14 22:59:18 +03:00
|
|
|
blkptr_t *bp = &lr->lr_blkptr;
|
2013-05-10 23:47:54 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
zgd->zgd_db = db;
|
|
|
|
zgd->zgd_bp = bp;
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(db->db_offset, ==, offset);
|
|
|
|
ASSERT3U(db->db_size, ==, size);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
error = dmu_sync(zio, lr->lr_common.lrc_txg,
|
|
|
|
ztest_get_done, zgd);
|
|
|
|
|
|
|
|
if (error == 0)
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
ztest_get_done(zgd, error);
|
|
|
|
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void *
|
|
|
|
ztest_lr_alloc(size_t lrsize, char *name)
|
|
|
|
{
|
|
|
|
char *lr;
|
|
|
|
size_t namesize = name ? strlen(name) + 1 : 0;
|
|
|
|
|
|
|
|
lr = umem_zalloc(lrsize + namesize, UMEM_NOFAIL);
|
|
|
|
|
|
|
|
if (name)
|
|
|
|
bcopy(name, lr + lrsize, namesize);
|
|
|
|
|
|
|
|
return (lr);
|
|
|
|
}
|
|
|
|
|
2020-06-15 21:30:37 +03:00
|
|
|
static void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_lr_free(void *lr, size_t lrsize, char *name)
|
|
|
|
{
|
|
|
|
size_t namesize = name ? strlen(name) + 1 : 0;
|
|
|
|
|
|
|
|
umem_free(lr, lrsize + namesize);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Lookup a bunch of objects. Returns the number of objects not found.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
ztest_lookup(ztest_ds_t *zd, ztest_od_t *od, int count)
|
|
|
|
{
|
|
|
|
int missing = 0;
|
|
|
|
int error;
|
2010-08-26 20:52:39 +04:00
|
|
|
int i;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
Simplify threads, mutexs, cvs and rwlocks
* Simplify threads, mutexs, cvs and rwlocks
* Update the zk_thread_create() function to use the same trick
as Illumos. Specifically, cast the new pthread_t to a void
pointer and return that as the kthread_t *. This avoids the
issues associated with managing a wrapper structure and is
safe as long as the callers never attempt to dereference it.
* Update all function prototypes passed to pthread_create() to
match the expected prototype. We were getting away this with
before since the function were explicitly cast.
* Replaced direct zk_thread_create() calls with thread_create()
for code consistency. All consumers of libzpool now use the
proper wrappers.
* The mutex_held() calls were converted to MUTEX_HELD().
* Removed all mutex_owner() calls and retired the interface.
Instead use MUTEX_HELD() which provides the same information
and allows the implementation details to be hidden. In this
case the use of the pthread_equals() function.
* The kthread_t, kmutex_t, krwlock_t, and krwlock_t types had
any non essential fields removed. In the case of kthread_t
and kcondvar_t they could be directly typedef'd to pthread_t
and pthread_cond_t respectively.
* Removed all extra ASSERTS from the thread, mutex, rwlock, and
cv wrapper functions. In practice, pthreads already provides
the vast majority of checks as long as we check the return
code. Removing this code from our wrappers help readability.
* Added TS_JOINABLE state flag to pass to request a joinable rather
than detached thread. This isn't a standard thread_create() state
but it's the least invasive way to pass this information and is
only used by ztest.
TEST_ZTEST_TIMEOUT=3600
Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4547
Closes #5503
Closes #5523
Closes #6377
Closes #6495
2017-08-11 18:51:44 +03:00
|
|
|
ASSERT(MUTEX_HELD(&zd->zd_dirobj_lock));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (i = 0; i < count; i++, od++) {
|
2010-05-29 00:45:14 +04:00
|
|
|
od->od_object = 0;
|
|
|
|
error = zap_lookup(zd->zd_os, od->od_dir, od->od_name,
|
|
|
|
sizeof (uint64_t), 1, &od->od_object);
|
|
|
|
if (error) {
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3S(error, ==, ENOENT);
|
|
|
|
ASSERT0(od->od_object);
|
2010-05-29 00:45:14 +04:00
|
|
|
missing++;
|
|
|
|
} else {
|
|
|
|
dmu_buf_t *db;
|
|
|
|
ztest_block_tag_t *bbt;
|
|
|
|
dmu_object_info_t doi;
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(od->od_object, !=, 0);
|
|
|
|
ASSERT0(missing); /* there should be no gaps */
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
ztest_object_lock(zd, od->od_object, RL_READER);
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_bonus_hold(zd->zd_os, od->od_object,
|
|
|
|
FTAG, &db));
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_object_info_from_db(db, &doi);
|
|
|
|
bbt = ztest_bt_bonus(db);
|
|
|
|
ASSERT3U(bbt->bt_magic, ==, BT_MAGIC);
|
|
|
|
od->od_type = doi.doi_type;
|
|
|
|
od->od_blocksize = doi.doi_data_block_size;
|
|
|
|
od->od_gen = bbt->bt_gen;
|
|
|
|
dmu_buf_rele(db, FTAG);
|
|
|
|
ztest_object_unlock(zd, od->od_object);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return (missing);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
ztest_create(ztest_ds_t *zd, ztest_od_t *od, int count)
|
|
|
|
{
|
|
|
|
int missing = 0;
|
2010-08-26 20:52:39 +04:00
|
|
|
int i;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
Simplify threads, mutexs, cvs and rwlocks
* Simplify threads, mutexs, cvs and rwlocks
* Update the zk_thread_create() function to use the same trick
as Illumos. Specifically, cast the new pthread_t to a void
pointer and return that as the kthread_t *. This avoids the
issues associated with managing a wrapper structure and is
safe as long as the callers never attempt to dereference it.
* Update all function prototypes passed to pthread_create() to
match the expected prototype. We were getting away this with
before since the function were explicitly cast.
* Replaced direct zk_thread_create() calls with thread_create()
for code consistency. All consumers of libzpool now use the
proper wrappers.
* The mutex_held() calls were converted to MUTEX_HELD().
* Removed all mutex_owner() calls and retired the interface.
Instead use MUTEX_HELD() which provides the same information
and allows the implementation details to be hidden. In this
case the use of the pthread_equals() function.
* The kthread_t, kmutex_t, krwlock_t, and krwlock_t types had
any non essential fields removed. In the case of kthread_t
and kcondvar_t they could be directly typedef'd to pthread_t
and pthread_cond_t respectively.
* Removed all extra ASSERTS from the thread, mutex, rwlock, and
cv wrapper functions. In practice, pthreads already provides
the vast majority of checks as long as we check the return
code. Removing this code from our wrappers help readability.
* Added TS_JOINABLE state flag to pass to request a joinable rather
than detached thread. This isn't a standard thread_create() state
but it's the least invasive way to pass this information and is
only used by ztest.
TEST_ZTEST_TIMEOUT=3600
Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4547
Closes #5503
Closes #5523
Closes #6377
Closes #6495
2017-08-11 18:51:44 +03:00
|
|
|
ASSERT(MUTEX_HELD(&zd->zd_dirobj_lock));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (i = 0; i < count; i++, od++) {
|
2010-05-29 00:45:14 +04:00
|
|
|
if (missing) {
|
|
|
|
od->od_object = 0;
|
|
|
|
missing++;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
lr_create_t *lr = ztest_lr_alloc(sizeof (*lr), od->od_name);
|
|
|
|
|
|
|
|
lr->lr_doid = od->od_dir;
|
|
|
|
lr->lr_foid = 0; /* 0 to allocate, > 0 to claim */
|
|
|
|
lr->lrz_type = od->od_crtype;
|
|
|
|
lr->lrz_blocksize = od->od_crblocksize;
|
|
|
|
lr->lrz_ibshift = ztest_random_ibshift();
|
|
|
|
lr->lrz_bonustype = DMU_OT_UINT64_OTHER;
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
lr->lrz_dnodesize = od->od_crdnodesize;
|
2010-05-29 00:45:14 +04:00
|
|
|
lr->lr_gen = od->od_crgen;
|
|
|
|
lr->lr_crtime[0] = time(NULL);
|
|
|
|
|
|
|
|
if (ztest_replay_create(zd, lr, B_FALSE) != 0) {
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT0(missing);
|
2010-05-29 00:45:14 +04:00
|
|
|
od->od_object = 0;
|
|
|
|
missing++;
|
|
|
|
} else {
|
|
|
|
od->od_object = lr->lr_foid;
|
|
|
|
od->od_type = od->od_crtype;
|
|
|
|
od->od_blocksize = od->od_crblocksize;
|
|
|
|
od->od_gen = od->od_crgen;
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(od->od_object, !=, 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
ztest_lr_free(lr, sizeof (*lr), od->od_name);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (missing);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
ztest_remove(ztest_ds_t *zd, ztest_od_t *od, int count)
|
|
|
|
{
|
|
|
|
int missing = 0;
|
|
|
|
int error;
|
2010-08-26 20:52:39 +04:00
|
|
|
int i;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
Simplify threads, mutexs, cvs and rwlocks
* Simplify threads, mutexs, cvs and rwlocks
* Update the zk_thread_create() function to use the same trick
as Illumos. Specifically, cast the new pthread_t to a void
pointer and return that as the kthread_t *. This avoids the
issues associated with managing a wrapper structure and is
safe as long as the callers never attempt to dereference it.
* Update all function prototypes passed to pthread_create() to
match the expected prototype. We were getting away this with
before since the function were explicitly cast.
* Replaced direct zk_thread_create() calls with thread_create()
for code consistency. All consumers of libzpool now use the
proper wrappers.
* The mutex_held() calls were converted to MUTEX_HELD().
* Removed all mutex_owner() calls and retired the interface.
Instead use MUTEX_HELD() which provides the same information
and allows the implementation details to be hidden. In this
case the use of the pthread_equals() function.
* The kthread_t, kmutex_t, krwlock_t, and krwlock_t types had
any non essential fields removed. In the case of kthread_t
and kcondvar_t they could be directly typedef'd to pthread_t
and pthread_cond_t respectively.
* Removed all extra ASSERTS from the thread, mutex, rwlock, and
cv wrapper functions. In practice, pthreads already provides
the vast majority of checks as long as we check the return
code. Removing this code from our wrappers help readability.
* Added TS_JOINABLE state flag to pass to request a joinable rather
than detached thread. This isn't a standard thread_create() state
but it's the least invasive way to pass this information and is
only used by ztest.
TEST_ZTEST_TIMEOUT=3600
Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4547
Closes #5503
Closes #5523
Closes #6377
Closes #6495
2017-08-11 18:51:44 +03:00
|
|
|
ASSERT(MUTEX_HELD(&zd->zd_dirobj_lock));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
od += count - 1;
|
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (i = count - 1; i >= 0; i--, od--) {
|
2010-05-29 00:45:14 +04:00
|
|
|
if (missing) {
|
|
|
|
missing++;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2013-05-10 23:47:54 +04:00
|
|
|
/*
|
|
|
|
* No object was found.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
if (od->od_object == 0)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
lr_remove_t *lr = ztest_lr_alloc(sizeof (*lr), od->od_name);
|
|
|
|
|
|
|
|
lr->lr_doid = od->od_dir;
|
|
|
|
|
|
|
|
if ((error = ztest_replay_remove(zd, lr, B_FALSE)) != 0) {
|
|
|
|
ASSERT3U(error, ==, ENOSPC);
|
|
|
|
missing++;
|
|
|
|
} else {
|
|
|
|
od->od_object = 0;
|
|
|
|
}
|
|
|
|
ztest_lr_free(lr, sizeof (*lr), od->od_name);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (missing);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
ztest_write(ztest_ds_t *zd, uint64_t object, uint64_t offset, uint64_t size,
|
|
|
|
void *data)
|
|
|
|
{
|
|
|
|
lr_write_t *lr;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
lr = ztest_lr_alloc(sizeof (*lr) + size, NULL);
|
|
|
|
|
|
|
|
lr->lr_foid = object;
|
|
|
|
lr->lr_offset = offset;
|
|
|
|
lr->lr_length = size;
|
|
|
|
lr->lr_blkoff = 0;
|
|
|
|
BP_ZERO(&lr->lr_blkptr);
|
|
|
|
|
|
|
|
bcopy(data, lr + 1, size);
|
|
|
|
|
|
|
|
error = ztest_replay_write(zd, lr, B_FALSE);
|
|
|
|
|
|
|
|
ztest_lr_free(lr, sizeof (*lr) + size, NULL);
|
|
|
|
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
ztest_truncate(ztest_ds_t *zd, uint64_t object, uint64_t offset, uint64_t size)
|
|
|
|
{
|
|
|
|
lr_truncate_t *lr;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
lr = ztest_lr_alloc(sizeof (*lr), NULL);
|
|
|
|
|
|
|
|
lr->lr_foid = object;
|
|
|
|
lr->lr_offset = offset;
|
|
|
|
lr->lr_length = size;
|
|
|
|
|
|
|
|
error = ztest_replay_truncate(zd, lr, B_FALSE);
|
|
|
|
|
|
|
|
ztest_lr_free(lr, sizeof (*lr), NULL);
|
|
|
|
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
ztest_setattr(ztest_ds_t *zd, uint64_t object)
|
|
|
|
{
|
|
|
|
lr_setattr_t *lr;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
lr = ztest_lr_alloc(sizeof (*lr), NULL);
|
|
|
|
|
|
|
|
lr->lr_foid = object;
|
|
|
|
lr->lr_size = 0;
|
|
|
|
lr->lr_mode = 0;
|
|
|
|
|
|
|
|
error = ztest_replay_setattr(zd, lr, B_FALSE);
|
|
|
|
|
|
|
|
ztest_lr_free(lr, sizeof (*lr), NULL);
|
|
|
|
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
ztest_prealloc(ztest_ds_t *zd, uint64_t object, uint64_t offset, uint64_t size)
|
|
|
|
{
|
|
|
|
objset_t *os = zd->zd_os;
|
|
|
|
dmu_tx_t *tx;
|
|
|
|
uint64_t txg;
|
2018-10-02 01:13:12 +03:00
|
|
|
rl_t *rl;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
txg_wait_synced(dmu_objset_pool(os), 0);
|
|
|
|
|
|
|
|
ztest_object_lock(zd, object, RL_READER);
|
|
|
|
rl = ztest_range_lock(zd, object, offset, size, RL_WRITER);
|
|
|
|
|
|
|
|
tx = dmu_tx_create(os);
|
|
|
|
|
|
|
|
dmu_tx_hold_write(tx, object, offset, size);
|
|
|
|
|
|
|
|
txg = ztest_tx_assign(tx, TXG_WAIT, FTAG);
|
|
|
|
|
|
|
|
if (txg != 0) {
|
|
|
|
dmu_prealloc(os, object, offset, size, tx);
|
|
|
|
dmu_tx_commit(tx);
|
|
|
|
txg_wait_synced(dmu_objset_pool(os), txg);
|
|
|
|
} else {
|
|
|
|
(void) dmu_free_long_range(os, object, offset, size);
|
|
|
|
}
|
|
|
|
|
2018-10-02 01:13:12 +03:00
|
|
|
ztest_range_unlock(rl);
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_object_unlock(zd, object);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
ztest_io(ztest_ds_t *zd, uint64_t object, uint64_t offset)
|
|
|
|
{
|
2013-05-10 23:47:54 +04:00
|
|
|
int err;
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_block_tag_t wbt;
|
|
|
|
dmu_object_info_t doi;
|
|
|
|
enum ztest_io_type io_type;
|
|
|
|
uint64_t blocksize;
|
|
|
|
void *data;
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_object_info(zd->zd_os, object, &doi));
|
2010-05-29 00:45:14 +04:00
|
|
|
blocksize = doi.doi_data_block_size;
|
|
|
|
data = umem_alloc(blocksize, UMEM_NOFAIL);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Pick an i/o type at random, biased toward writing block tags.
|
|
|
|
*/
|
|
|
|
io_type = ztest_random(ZTEST_IO_TYPES);
|
|
|
|
if (ztest_random(2) == 0)
|
|
|
|
io_type = ZTEST_IO_WRITE_TAG;
|
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_rdlock(&zd->zd_zilog_lock);
|
2011-07-26 23:41:53 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
switch (io_type) {
|
|
|
|
|
|
|
|
case ZTEST_IO_WRITE_TAG:
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_bt_generate(&wbt, zd->zd_os, object, doi.doi_dnodesize,
|
|
|
|
offset, 0, 0, 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) ztest_write(zd, object, offset, sizeof (wbt), &wbt);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case ZTEST_IO_WRITE_PATTERN:
|
|
|
|
(void) memset(data, 'a' + (object + offset) % 5, blocksize);
|
|
|
|
if (ztest_random(2) == 0) {
|
|
|
|
/*
|
|
|
|
* Induce fletcher2 collisions to ensure that
|
|
|
|
* zio_ddt_collision() detects and resolves them
|
|
|
|
* when using fletcher2-verify for deduplication.
|
|
|
|
*/
|
|
|
|
((uint64_t *)data)[0] ^= 1ULL << 63;
|
|
|
|
((uint64_t *)data)[4] ^= 1ULL << 63;
|
|
|
|
}
|
|
|
|
(void) ztest_write(zd, object, offset, blocksize, data);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case ZTEST_IO_WRITE_ZEROES:
|
|
|
|
bzero(data, blocksize);
|
|
|
|
(void) ztest_write(zd, object, offset, blocksize, data);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case ZTEST_IO_TRUNCATE:
|
|
|
|
(void) ztest_truncate(zd, object, offset, blocksize);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case ZTEST_IO_SETATTR:
|
|
|
|
(void) ztest_setattr(zd, object);
|
|
|
|
break;
|
2010-08-26 20:52:41 +04:00
|
|
|
default:
|
|
|
|
break;
|
2013-05-10 23:47:54 +04:00
|
|
|
|
|
|
|
case ZTEST_IO_REWRITE:
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_rdlock(&ztest_name_lock);
|
2013-05-10 23:47:54 +04:00
|
|
|
err = ztest_dsl_prop_set_uint64(zd->zd_name,
|
|
|
|
ZFS_PROP_CHECKSUM, spa_dedup_checksum(ztest_spa),
|
|
|
|
B_FALSE);
|
|
|
|
VERIFY(err == 0 || err == ENOSPC);
|
|
|
|
err = ztest_dsl_prop_set_uint64(zd->zd_name,
|
|
|
|
ZFS_PROP_COMPRESSION,
|
|
|
|
ztest_random_dsl_prop(ZFS_PROP_COMPRESSION),
|
|
|
|
B_FALSE);
|
|
|
|
VERIFY(err == 0 || err == ENOSPC);
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&ztest_name_lock);
|
2013-05-10 23:47:54 +04:00
|
|
|
|
|
|
|
VERIFY0(dmu_read(zd->zd_os, object, offset, blocksize, data,
|
|
|
|
DMU_READ_NO_PREFETCH));
|
|
|
|
|
|
|
|
(void) ztest_write(zd, object, offset, blocksize, data);
|
|
|
|
break;
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&zd->zd_zilog_lock);
|
2011-07-26 23:41:53 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
umem_free(data, blocksize);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Initialize an object description template.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
ztest_od_init(ztest_od_t *od, uint64_t id, char *tag, uint64_t index,
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
dmu_object_type_t type, uint64_t blocksize, uint64_t dnodesize,
|
|
|
|
uint64_t gen)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
|
|
|
od->od_dir = ZTEST_DIROBJ;
|
|
|
|
od->od_object = 0;
|
|
|
|
|
|
|
|
od->od_crtype = type;
|
|
|
|
od->od_crblocksize = blocksize ? blocksize : ztest_random_blocksize();
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
od->od_crdnodesize = dnodesize ? dnodesize : ztest_random_dnodesize();
|
2010-05-29 00:45:14 +04:00
|
|
|
od->od_crgen = gen;
|
|
|
|
|
|
|
|
od->od_type = DMU_OT_NONE;
|
|
|
|
od->od_blocksize = 0;
|
|
|
|
od->od_gen = 0;
|
|
|
|
|
|
|
|
(void) snprintf(od->od_name, sizeof (od->od_name), "%s(%lld)[%llu]",
|
2010-08-26 20:52:39 +04:00
|
|
|
tag, (longlong_t)id, (u_longlong_t)index);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Lookup or create the objects for a test using the od template.
|
|
|
|
* If the objects do not all exist, or if 'remove' is specified,
|
|
|
|
* remove any existing objects and create new ones. Otherwise,
|
|
|
|
* use the existing objects.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
ztest_object_init(ztest_ds_t *zd, ztest_od_t *od, size_t size, boolean_t remove)
|
|
|
|
{
|
|
|
|
int count = size / sizeof (*od);
|
|
|
|
int rv = 0;
|
|
|
|
|
2010-08-26 21:43:27 +04:00
|
|
|
mutex_enter(&zd->zd_dirobj_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
if ((ztest_lookup(zd, od, count) != 0 || remove) &&
|
|
|
|
(ztest_remove(zd, od, count) != 0 ||
|
|
|
|
ztest_create(zd, od, count) != 0))
|
|
|
|
rv = -1;
|
|
|
|
zd->zd_od = od;
|
2010-08-26 21:43:27 +04:00
|
|
|
mutex_exit(&zd->zd_dirobj_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
return (rv);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* ARGSUSED */
|
|
|
|
void
|
|
|
|
ztest_zil_commit(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
|
|
|
zilog_t *zilog = zd->zd_zilog;
|
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_rdlock(&zd->zd_zilog_lock);
|
2011-07-26 23:41:53 +04:00
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
zil_commit(zilog, ztest_random(ZTEST_OBJECTS));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Remember the committed values in zd, which is in parent/child
|
|
|
|
* shared memory. If we die, the next iteration of ztest_run()
|
|
|
|
* will verify that the log really does contain this record.
|
|
|
|
*/
|
|
|
|
mutex_enter(&zilog->zl_lock);
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3P(zd->zd_shared, !=, NULL);
|
2012-01-24 06:43:32 +04:00
|
|
|
ASSERT3U(zd->zd_shared->zd_seq, <=, zilog->zl_commit_lr_seq);
|
|
|
|
zd->zd_shared->zd_seq = zilog->zl_commit_lr_seq;
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_exit(&zilog->zl_lock);
|
2011-07-26 23:41:53 +04:00
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&zd->zd_zilog_lock);
|
2011-07-26 23:41:53 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This function is designed to simulate the operations that occur during a
|
|
|
|
* mount/unmount operation. We hold the dataset across these operations in an
|
|
|
|
* attempt to expose any implicit assumptions about ZIL management.
|
|
|
|
*/
|
|
|
|
/* ARGSUSED */
|
|
|
|
void
|
|
|
|
ztest_zil_remount(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
|
|
|
objset_t *os = zd->zd_os;
|
|
|
|
|
2018-12-04 20:43:31 +03:00
|
|
|
/*
|
|
|
|
* We hold the ztest_vdev_lock so we don't cause problems with
|
|
|
|
* other threads that wish to remove a log device, such as
|
|
|
|
* ztest_device_removal().
|
|
|
|
*/
|
|
|
|
mutex_enter(&ztest_vdev_lock);
|
|
|
|
|
2013-05-10 23:47:54 +04:00
|
|
|
/*
|
|
|
|
* We grab the zd_dirobj_lock to ensure that no other thread is
|
|
|
|
* updating the zil (i.e. adding in-memory log records) and the
|
|
|
|
* zd_zilog_lock to block any I/O.
|
|
|
|
*/
|
2012-12-15 04:13:40 +04:00
|
|
|
mutex_enter(&zd->zd_dirobj_lock);
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_wrlock(&zd->zd_zilog_lock);
|
2011-07-26 23:41:53 +04:00
|
|
|
|
2017-03-09 01:56:19 +03:00
|
|
|
/* zfsvfs_teardown() */
|
2011-07-26 23:41:53 +04:00
|
|
|
zil_close(zd->zd_zilog);
|
|
|
|
|
|
|
|
/* zfsvfs_setup() */
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY3P(zil_open(os, ztest_get_data), ==, zd->zd_zilog);
|
2011-07-26 23:41:53 +04:00
|
|
|
zil_replay(os, zd, ztest_replay_vector);
|
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&zd->zd_zilog_lock);
|
2012-12-15 04:13:40 +04:00
|
|
|
mutex_exit(&zd->zd_dirobj_lock);
|
2018-12-04 20:43:31 +03:00
|
|
|
mutex_exit(&ztest_vdev_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that we can't destroy an active pool, create an existing pool,
|
|
|
|
* or create a pool with a bad vdev spec.
|
|
|
|
*/
|
|
|
|
/* ARGSUSED */
|
|
|
|
void
|
|
|
|
ztest_spa_create_destroy(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_shared_opts_t *zo = &ztest_opts;
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_t *spa;
|
|
|
|
nvlist_t *nvroot;
|
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
if (zo->zo_mmp_test)
|
|
|
|
return;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Attempt to create using a bad file.
|
|
|
|
*/
|
2018-09-06 04:33:36 +03:00
|
|
|
nvroot = make_vdev_root("/dev/bogus", NULL, NULL, 0, 0, NULL, 0, 0, 1);
|
2010-05-29 00:45:14 +04:00
|
|
|
VERIFY3U(ENOENT, ==,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
spa_create("ztest_bad_file", nvroot, NULL, NULL, NULL));
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(nvroot);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Attempt to create using a bad mirror.
|
|
|
|
*/
|
2018-09-06 04:33:36 +03:00
|
|
|
nvroot = make_vdev_root("/dev/bogus", NULL, NULL, 0, 0, NULL, 0, 2, 1);
|
2010-05-29 00:45:14 +04:00
|
|
|
VERIFY3U(ENOENT, ==,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
spa_create("ztest_bad_mirror", nvroot, NULL, NULL, NULL));
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(nvroot);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Attempt to create an existing pool. It shouldn't matter
|
|
|
|
* what's in the nvroot; we should fail with EEXIST.
|
|
|
|
*/
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_rdlock(&ztest_name_lock);
|
2018-09-06 04:33:36 +03:00
|
|
|
nvroot = make_vdev_root("/dev/bogus", NULL, NULL, 0, 0, NULL, 0, 0, 1);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
VERIFY3U(EEXIST, ==,
|
|
|
|
spa_create(zo->zo_pool, nvroot, NULL, NULL, NULL));
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(nvroot);
|
2019-07-18 23:02:33 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We open a reference to the spa and then we try to export it
|
|
|
|
* expecting one of the following errors:
|
|
|
|
*
|
|
|
|
* EBUSY
|
|
|
|
* Because of the reference we just opened.
|
|
|
|
*
|
|
|
|
* ZFS_ERR_EXPORT_IN_PROGRESS
|
|
|
|
* For the case that there is another ztest thread doing
|
|
|
|
* an export concurrently.
|
|
|
|
*/
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(spa_open(zo->zo_pool, &spa, FTAG));
|
2019-07-18 23:02:33 +03:00
|
|
|
int error = spa_destroy(zo->zo_pool);
|
|
|
|
if (error != EBUSY && error != ZFS_ERR_EXPORT_IN_PROGRESS) {
|
|
|
|
fatal(0, "spa_destroy(%s) returned unexpected value %d",
|
|
|
|
spa->spa_name, error);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_close(spa, FTAG);
|
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&ztest_name_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2017-07-21 03:54:26 +03:00
|
|
|
/*
|
|
|
|
* Start and then stop the MMP threads to ensure the startup and shutdown code
|
|
|
|
* works properly. Actual protection and property-related code tested via ZTS.
|
|
|
|
*/
|
|
|
|
/* ARGSUSED */
|
|
|
|
void
|
|
|
|
ztest_mmp_enable_disable(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
|
|
|
ztest_shared_opts_t *zo = &ztest_opts;
|
|
|
|
spa_t *spa = ztest_spa;
|
|
|
|
|
|
|
|
if (zo->zo_mmp_test)
|
|
|
|
return;
|
|
|
|
|
2019-03-01 04:56:19 +03:00
|
|
|
/*
|
|
|
|
* Since enabling MMP involves setting a property, it could not be done
|
|
|
|
* while the pool is suspended.
|
|
|
|
*/
|
|
|
|
if (spa_suspended(spa))
|
|
|
|
return;
|
|
|
|
|
2017-07-21 03:54:26 +03:00
|
|
|
spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
|
|
|
|
mutex_enter(&spa->spa_props_lock);
|
|
|
|
|
2018-08-13 18:12:25 +03:00
|
|
|
zfs_multihost_fail_intervals = 0;
|
|
|
|
|
2017-07-21 03:54:26 +03:00
|
|
|
if (!spa_multihost(spa)) {
|
|
|
|
spa->spa_multihost = B_TRUE;
|
|
|
|
mmp_thread_start(spa);
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_exit(&spa->spa_props_lock);
|
|
|
|
spa_config_exit(spa, SCL_CONFIG, FTAG);
|
|
|
|
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
mmp_signal_all_threads();
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
|
|
|
|
mutex_enter(&spa->spa_props_lock);
|
|
|
|
|
|
|
|
if (spa_multihost(spa)) {
|
|
|
|
mmp_thread_stop(spa);
|
|
|
|
spa->spa_multihost = B_FALSE;
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_exit(&spa->spa_props_lock);
|
|
|
|
spa_config_exit(spa, SCL_CONFIG, FTAG);
|
|
|
|
}
|
|
|
|
|
2012-12-15 04:28:49 +04:00
|
|
|
/* ARGSUSED */
|
|
|
|
void
|
|
|
|
ztest_spa_upgrade(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
|
|
|
spa_t *spa;
|
|
|
|
uint64_t initial_version = SPA_VERSION_INITIAL;
|
|
|
|
uint64_t version, newversion;
|
|
|
|
nvlist_t *nvroot, *props;
|
|
|
|
char *name;
|
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
if (ztest_opts.zo_mmp_test)
|
|
|
|
return;
|
|
|
|
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
/* dRAID added after feature flags, skip upgrade test. */
|
|
|
|
if (strcmp(ztest_opts.zo_raid_type, VDEV_TYPE_DRAID) == 0)
|
|
|
|
return;
|
|
|
|
|
2012-12-15 04:28:49 +04:00
|
|
|
mutex_enter(&ztest_vdev_lock);
|
|
|
|
name = kmem_asprintf("%s_upgrade", ztest_opts.zo_pool);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Clean up from previous runs.
|
|
|
|
*/
|
|
|
|
(void) spa_destroy(name);
|
|
|
|
|
|
|
|
nvroot = make_vdev_root(NULL, NULL, name, ztest_opts.zo_vdev_size, 0,
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
NULL, ztest_opts.zo_raid_children, ztest_opts.zo_mirrors, 1);
|
2012-12-15 04:28:49 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we're configuring a RAIDZ device then make sure that the
|
2019-01-13 21:11:52 +03:00
|
|
|
* initial version is capable of supporting that feature.
|
2012-12-15 04:28:49 +04:00
|
|
|
*/
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
switch (ztest_opts.zo_raid_parity) {
|
2012-12-15 04:28:49 +04:00
|
|
|
case 0:
|
|
|
|
case 1:
|
|
|
|
initial_version = SPA_VERSION_INITIAL;
|
|
|
|
break;
|
|
|
|
case 2:
|
|
|
|
initial_version = SPA_VERSION_RAIDZ2;
|
|
|
|
break;
|
|
|
|
case 3:
|
|
|
|
initial_version = SPA_VERSION_RAIDZ3;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Create a pool with a spa version that can be upgraded. Pick
|
|
|
|
* a value between initial_version and SPA_VERSION_BEFORE_FEATURES.
|
|
|
|
*/
|
|
|
|
do {
|
|
|
|
version = ztest_random_spa_version(initial_version);
|
|
|
|
} while (version > SPA_VERSION_BEFORE_FEATURES);
|
|
|
|
|
|
|
|
props = fnvlist_alloc();
|
|
|
|
fnvlist_add_uint64(props,
|
|
|
|
zpool_prop_to_name(ZPOOL_PROP_VERSION), version);
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(spa_create(name, nvroot, props, NULL, NULL));
|
2012-12-15 04:28:49 +04:00
|
|
|
fnvlist_free(nvroot);
|
|
|
|
fnvlist_free(props);
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(spa_open(name, &spa, FTAG));
|
2012-12-15 04:28:49 +04:00
|
|
|
VERIFY3U(spa_version(spa), ==, version);
|
|
|
|
newversion = ztest_random_spa_version(version + 1);
|
|
|
|
|
|
|
|
if (ztest_opts.zo_verbose >= 4) {
|
|
|
|
(void) printf("upgrading spa version from %llu to %llu\n",
|
|
|
|
(u_longlong_t)version, (u_longlong_t)newversion);
|
|
|
|
}
|
|
|
|
|
|
|
|
spa_upgrade(spa, newversion);
|
|
|
|
VERIFY3U(spa_version(spa), >, version);
|
|
|
|
VERIFY3U(spa_version(spa), ==, fnvlist_lookup_uint64(spa->spa_config,
|
|
|
|
zpool_prop_to_name(ZPOOL_PROP_VERSION)));
|
|
|
|
spa_close(spa, FTAG);
|
|
|
|
|
2019-10-10 19:47:06 +03:00
|
|
|
kmem_strfree(name);
|
2012-12-15 04:28:49 +04:00
|
|
|
mutex_exit(&ztest_vdev_lock);
|
|
|
|
}
|
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
static void
|
|
|
|
ztest_spa_checkpoint(spa_t *spa)
|
|
|
|
{
|
|
|
|
ASSERT(MUTEX_HELD(&ztest_checkpoint_lock));
|
|
|
|
|
|
|
|
int error = spa_checkpoint(spa->spa_name);
|
|
|
|
|
|
|
|
switch (error) {
|
|
|
|
case 0:
|
|
|
|
case ZFS_ERR_DEVRM_IN_PROGRESS:
|
|
|
|
case ZFS_ERR_DISCARDING_CHECKPOINT:
|
|
|
|
case ZFS_ERR_CHECKPOINT_EXISTS:
|
|
|
|
break;
|
|
|
|
case ENOSPC:
|
|
|
|
ztest_record_enospc(FTAG);
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
fatal(0, "spa_checkpoint(%s) = %d", spa->spa_name, error);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
ztest_spa_discard_checkpoint(spa_t *spa)
|
|
|
|
{
|
|
|
|
ASSERT(MUTEX_HELD(&ztest_checkpoint_lock));
|
|
|
|
|
|
|
|
int error = spa_checkpoint_discard(spa->spa_name);
|
|
|
|
|
|
|
|
switch (error) {
|
|
|
|
case 0:
|
|
|
|
case ZFS_ERR_DISCARDING_CHECKPOINT:
|
|
|
|
case ZFS_ERR_NO_CHECKPOINT:
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
fatal(0, "spa_discard_checkpoint(%s) = %d",
|
|
|
|
spa->spa_name, error);
|
|
|
|
}
|
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
/* ARGSUSED */
|
|
|
|
void
|
|
|
|
ztest_spa_checkpoint_create_discard(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
|
|
|
spa_t *spa = ztest_spa;
|
|
|
|
|
|
|
|
mutex_enter(&ztest_checkpoint_lock);
|
|
|
|
if (ztest_random(2) == 0) {
|
|
|
|
ztest_spa_checkpoint(spa);
|
|
|
|
} else {
|
|
|
|
ztest_spa_discard_checkpoint(spa);
|
|
|
|
}
|
|
|
|
mutex_exit(&ztest_checkpoint_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static vdev_t *
|
|
|
|
vdev_lookup_by_path(vdev_t *vd, const char *path)
|
|
|
|
{
|
|
|
|
vdev_t *mvd;
|
2010-08-26 20:52:39 +04:00
|
|
|
int c;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (vd->vdev_path != NULL && strcmp(path, vd->vdev_path) == 0)
|
|
|
|
return (vd);
|
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (c = 0; c < vd->vdev_children; c++)
|
2010-05-29 00:45:14 +04:00
|
|
|
if ((mvd = vdev_lookup_by_path(vd->vdev_child[c], path)) !=
|
|
|
|
NULL)
|
|
|
|
return (mvd);
|
|
|
|
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
static int
|
|
|
|
spa_num_top_vdevs(spa_t *spa)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
ASSERT3U(spa_config_held(spa, SCL_VDEV, RW_READER), ==, SCL_VDEV);
|
|
|
|
return (rvd->vdev_children);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that vdev_add() works as expected.
|
|
|
|
*/
|
|
|
|
/* ARGSUSED */
|
|
|
|
void
|
|
|
|
ztest_vdev_add_remove(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
|
|
|
ztest_shared_t *zs = ztest_shared;
|
2012-01-24 06:43:32 +04:00
|
|
|
spa_t *spa = ztest_spa;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t leaves;
|
|
|
|
uint64_t guid;
|
|
|
|
nvlist_t *nvroot;
|
|
|
|
int error;
|
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
if (ztest_opts.zo_mmp_test)
|
|
|
|
return;
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_enter(&ztest_vdev_lock);
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
leaves = MAX(zs->zs_mirrors + zs->zs_splits, 1) *
|
|
|
|
ztest_opts.zo_raid_children;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
|
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
ztest_shared->zs_vdev_next_leaf = spa_num_top_vdevs(spa) * leaves;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we have slogs then remove them 1/4 of the time.
|
|
|
|
*/
|
|
|
|
if (spa_has_slogs(spa) && ztest_random(4) == 0) {
|
2018-09-06 04:33:36 +03:00
|
|
|
metaslab_group_t *mg;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
2018-09-06 04:33:36 +03:00
|
|
|
* find the first real slog in log allocation class
|
2010-05-29 00:45:14 +04:00
|
|
|
*/
|
2020-12-15 21:55:44 +03:00
|
|
|
mg = spa_log_class(spa)->mc_allocator[0].mca_rotor;
|
2018-09-06 04:33:36 +03:00
|
|
|
while (!mg->mg_vd->vdev_islog)
|
|
|
|
mg = mg->mg_next;
|
|
|
|
|
|
|
|
guid = mg->mg_vd->vdev_guid;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
spa_config_exit(spa, SCL_VDEV, FTAG);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We have to grab the zs_name_lock as writer to
|
|
|
|
* prevent a race between removing a slog (dmu_objset_find)
|
|
|
|
* and destroying a dataset. Removing the slog will
|
|
|
|
* grab a reference on the dataset which may cause
|
2013-09-04 16:00:57 +04:00
|
|
|
* dsl_destroy_head() to fail with EBUSY thus
|
2010-05-29 00:45:14 +04:00
|
|
|
* leaving the dataset in an inconsistent state.
|
|
|
|
*/
|
2018-06-05 02:52:10 +03:00
|
|
|
pthread_rwlock_wrlock(&ztest_name_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
error = spa_vdev_remove(spa, guid, B_FALSE);
|
2018-06-05 02:52:10 +03:00
|
|
|
pthread_rwlock_unlock(&ztest_name_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2018-06-14 19:41:27 +03:00
|
|
|
switch (error) {
|
|
|
|
case 0:
|
|
|
|
case EEXIST: /* Generic zil_reset() error */
|
|
|
|
case EBUSY: /* Replay required */
|
|
|
|
case EACCES: /* Crypto key not loaded */
|
2016-12-17 01:11:29 +03:00
|
|
|
case ZFS_ERR_CHECKPOINT_EXISTS:
|
|
|
|
case ZFS_ERR_DISCARDING_CHECKPOINT:
|
2018-06-14 19:41:27 +03:00
|
|
|
break;
|
|
|
|
default:
|
2010-05-29 00:45:14 +04:00
|
|
|
fatal(0, "spa_vdev_remove() = %d", error);
|
2018-06-14 19:41:27 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
} else {
|
|
|
|
spa_config_exit(spa, SCL_VDEV, FTAG);
|
|
|
|
|
|
|
|
/*
|
2018-09-06 04:33:36 +03:00
|
|
|
* Make 1/4 of the devices be log devices
|
2010-05-29 00:45:14 +04:00
|
|
|
*/
|
2012-12-15 04:28:49 +04:00
|
|
|
nvroot = make_vdev_root(NULL, NULL, NULL,
|
2018-09-06 04:33:36 +03:00
|
|
|
ztest_opts.zo_vdev_size, 0, (ztest_random(4) == 0) ?
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
"log" : NULL, ztest_opts.zo_raid_children, zs->zs_mirrors,
|
|
|
|
1);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
error = spa_vdev_add(spa, nvroot);
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(nvroot);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
switch (error) {
|
|
|
|
case 0:
|
|
|
|
break;
|
|
|
|
case ENOSPC:
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_record_enospc("spa_vdev_add");
|
2016-12-17 01:11:29 +03:00
|
|
|
break;
|
|
|
|
default:
|
2010-05-29 00:45:14 +04:00
|
|
|
fatal(0, "spa_vdev_add() = %d", error);
|
2016-12-17 01:11:29 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_exit(&ztest_vdev_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2018-09-06 04:33:36 +03:00
|
|
|
/* ARGSUSED */
|
|
|
|
void
|
|
|
|
ztest_vdev_class_add(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
|
|
|
ztest_shared_t *zs = ztest_shared;
|
|
|
|
spa_t *spa = ztest_spa;
|
|
|
|
uint64_t leaves;
|
|
|
|
nvlist_t *nvroot;
|
|
|
|
const char *class = (ztest_random(2) == 0) ?
|
|
|
|
VDEV_ALLOC_BIAS_SPECIAL : VDEV_ALLOC_BIAS_DEDUP;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* By default add a special vdev 50% of the time
|
|
|
|
*/
|
|
|
|
if ((ztest_opts.zo_special_vdevs == ZTEST_VDEV_CLASS_OFF) ||
|
|
|
|
(ztest_opts.zo_special_vdevs == ZTEST_VDEV_CLASS_RND &&
|
|
|
|
ztest_random(2) == 0)) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_enter(&ztest_vdev_lock);
|
|
|
|
|
|
|
|
/* Only test with mirrors */
|
|
|
|
if (zs->zs_mirrors < 2) {
|
|
|
|
mutex_exit(&ztest_vdev_lock);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* requires feature@allocation_classes */
|
|
|
|
if (!spa_feature_is_enabled(spa, SPA_FEATURE_ALLOCATION_CLASSES)) {
|
|
|
|
mutex_exit(&ztest_vdev_lock);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
leaves = MAX(zs->zs_mirrors + zs->zs_splits, 1) *
|
|
|
|
ztest_opts.zo_raid_children;
|
2018-09-06 04:33:36 +03:00
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
ztest_shared->zs_vdev_next_leaf = spa_num_top_vdevs(spa) * leaves;
|
2018-09-06 04:33:36 +03:00
|
|
|
spa_config_exit(spa, SCL_VDEV, FTAG);
|
|
|
|
|
|
|
|
nvroot = make_vdev_root(NULL, NULL, NULL, ztest_opts.zo_vdev_size, 0,
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
class, ztest_opts.zo_raid_children, zs->zs_mirrors, 1);
|
2018-09-06 04:33:36 +03:00
|
|
|
|
|
|
|
error = spa_vdev_add(spa, nvroot);
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(nvroot);
|
2018-09-06 04:33:36 +03:00
|
|
|
|
|
|
|
if (error == ENOSPC)
|
|
|
|
ztest_record_enospc("spa_vdev_add");
|
|
|
|
else if (error != 0)
|
|
|
|
fatal(0, "spa_vdev_add() = %d", error);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* 50% of the time allow small blocks in the special class
|
|
|
|
*/
|
|
|
|
if (error == 0 &&
|
|
|
|
spa_special_class(spa)->mc_groups == 1 && ztest_random(2) == 0) {
|
|
|
|
if (ztest_opts.zo_verbose >= 3)
|
|
|
|
(void) printf("Enabling special VDEV small blocks\n");
|
|
|
|
(void) ztest_dsl_prop_set_uint64(zd->zd_name,
|
|
|
|
ZFS_PROP_SPECIAL_SMALL_BLOCKS, 32768, B_FALSE);
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_exit(&ztest_vdev_lock);
|
|
|
|
|
|
|
|
if (ztest_opts.zo_verbose >= 3) {
|
|
|
|
metaslab_class_t *mc;
|
|
|
|
|
|
|
|
if (strcmp(class, VDEV_ALLOC_BIAS_SPECIAL) == 0)
|
|
|
|
mc = spa_special_class(spa);
|
|
|
|
else
|
|
|
|
mc = spa_dedup_class(spa);
|
|
|
|
(void) printf("Added a %s mirrored vdev (of %d)\n",
|
|
|
|
class, (int)mc->mc_groups);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Verify that adding/removing aux devices (l2arc, hot spare) works as expected.
|
|
|
|
*/
|
|
|
|
/* ARGSUSED */
|
|
|
|
void
|
|
|
|
ztest_vdev_aux_add_remove(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
|
|
|
ztest_shared_t *zs = ztest_shared;
|
2012-01-24 06:43:32 +04:00
|
|
|
spa_t *spa = ztest_spa;
|
2010-05-29 00:45:14 +04:00
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
spa_aux_vdev_t *sav;
|
|
|
|
char *aux;
|
2010-08-26 22:13:05 +04:00
|
|
|
char *path;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t guid = 0;
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
int error, ignore_err = 0;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
if (ztest_opts.zo_mmp_test)
|
|
|
|
return;
|
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
path = umem_alloc(MAXPATHLEN, UMEM_NOFAIL);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (ztest_random(2) == 0) {
|
|
|
|
sav = &spa->spa_spares;
|
|
|
|
aux = ZPOOL_CONFIG_SPARES;
|
|
|
|
} else {
|
|
|
|
sav = &spa->spa_l2cache;
|
|
|
|
aux = ZPOOL_CONFIG_L2CACHE;
|
|
|
|
}
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_enter(&ztest_vdev_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
|
|
|
|
|
|
|
|
if (sav->sav_count != 0 && ztest_random(4) == 0) {
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* Pick a random device to remove.
|
|
|
|
*/
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
vdev_t *svd = sav->sav_vdevs[ztest_random(sav->sav_count)];
|
|
|
|
|
|
|
|
/* dRAID spares cannot be removed; try anyways to see ENOTSUP */
|
|
|
|
if (strstr(svd->vdev_path, VDEV_TYPE_DRAID) != NULL)
|
|
|
|
ignore_err = ENOTSUP;
|
|
|
|
|
|
|
|
guid = svd->vdev_guid;
|
2008-12-03 23:09:06 +03:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Find an unused device we can add.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
zs->zs_vdev_aux = 0;
|
2008-12-03 23:09:06 +03:00
|
|
|
for (;;) {
|
|
|
|
int c;
|
2013-07-02 08:07:15 +04:00
|
|
|
(void) snprintf(path, MAXPATHLEN, ztest_aux_template,
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_opts.zo_dir, ztest_opts.zo_pool, aux,
|
|
|
|
zs->zs_vdev_aux);
|
2008-12-03 23:09:06 +03:00
|
|
|
for (c = 0; c < sav->sav_count; c++)
|
|
|
|
if (strcmp(sav->sav_vdevs[c]->vdev_path,
|
|
|
|
path) == 0)
|
|
|
|
break;
|
|
|
|
if (c == sav->sav_count &&
|
|
|
|
vdev_lookup_by_path(rvd, path) == NULL)
|
|
|
|
break;
|
2010-05-29 00:45:14 +04:00
|
|
|
zs->zs_vdev_aux++;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_exit(spa, SCL_VDEV, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
if (guid == 0) {
|
|
|
|
/*
|
|
|
|
* Add a new device.
|
|
|
|
*/
|
2012-12-15 04:28:49 +04:00
|
|
|
nvlist_t *nvroot = make_vdev_root(NULL, aux, NULL,
|
2018-09-06 04:33:36 +03:00
|
|
|
(ztest_opts.zo_vdev_size * 5) / 4, 0, NULL, 0, 0, 1);
|
2008-12-03 23:09:06 +03:00
|
|
|
error = spa_vdev_add(spa, nvroot);
|
2016-12-17 01:11:29 +03:00
|
|
|
|
|
|
|
switch (error) {
|
|
|
|
case 0:
|
|
|
|
break;
|
|
|
|
default:
|
2008-12-03 23:09:06 +03:00
|
|
|
fatal(0, "spa_vdev_add(%p) = %d", nvroot, error);
|
2016-12-17 01:11:29 +03:00
|
|
|
}
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(nvroot);
|
2008-12-03 23:09:06 +03:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Remove an existing device. Sometimes, dirty its
|
|
|
|
* vdev state first to make sure we handle removal
|
|
|
|
* of devices that have pending state changes.
|
|
|
|
*/
|
|
|
|
if (ztest_random(2) == 0)
|
2009-07-03 02:44:48 +04:00
|
|
|
(void) vdev_online(spa, guid, 0, NULL);
|
2008-12-03 23:09:06 +03:00
|
|
|
|
|
|
|
error = spa_vdev_remove(spa, guid, B_FALSE);
|
2016-12-17 01:11:29 +03:00
|
|
|
|
|
|
|
switch (error) {
|
|
|
|
case 0:
|
|
|
|
case EBUSY:
|
|
|
|
case ZFS_ERR_CHECKPOINT_EXISTS:
|
|
|
|
case ZFS_ERR_DISCARDING_CHECKPOINT:
|
|
|
|
break;
|
|
|
|
default:
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
if (error != ignore_err)
|
|
|
|
fatal(0, "spa_vdev_remove(%llu) = %d", guid,
|
|
|
|
error);
|
2016-12-17 01:11:29 +03:00
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_exit(&ztest_vdev_lock);
|
2010-08-26 22:13:05 +04:00
|
|
|
|
|
|
|
umem_free(path, MAXPATHLEN);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* split a pool if it has mirror tlvdevs
|
|
|
|
*/
|
|
|
|
/* ARGSUSED */
|
|
|
|
void
|
|
|
|
ztest_split_pool(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
|
|
|
ztest_shared_t *zs = ztest_shared;
|
2012-01-24 06:43:32 +04:00
|
|
|
spa_t *spa = ztest_spa;
|
2010-05-29 00:45:14 +04:00
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
nvlist_t *tree, **child, *config, *split, **schild;
|
|
|
|
uint_t c, children, schildren = 0, lastlogid = 0;
|
|
|
|
int error = 0;
|
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
if (ztest_opts.zo_mmp_test)
|
|
|
|
return;
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_enter(&ztest_vdev_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2019-01-13 21:11:52 +03:00
|
|
|
/* ensure we have a usable config; mirrors of raidz aren't supported */
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
if (zs->zs_mirrors < 3 || ztest_opts.zo_raid_children > 1) {
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_exit(&ztest_vdev_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* clean up the old pool, if any */
|
|
|
|
(void) spa_destroy("splitp");
|
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
|
|
|
|
|
|
|
|
/* generate a config from the existing config */
|
|
|
|
mutex_enter(&spa->spa_props_lock);
|
2021-01-11 20:30:33 +03:00
|
|
|
tree = fnvlist_lookup_nvlist(spa->spa_config, ZPOOL_CONFIG_VDEV_TREE);
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_exit(&spa->spa_props_lock);
|
|
|
|
|
2021-01-11 20:30:33 +03:00
|
|
|
VERIFY0(nvlist_lookup_nvlist_array(tree, ZPOOL_CONFIG_CHILDREN,
|
|
|
|
&child, &children));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
schild = malloc(rvd->vdev_children * sizeof (nvlist_t *));
|
|
|
|
for (c = 0; c < children; c++) {
|
|
|
|
vdev_t *tvd = rvd->vdev_child[c];
|
|
|
|
nvlist_t **mchild;
|
|
|
|
uint_t mchildren;
|
|
|
|
|
|
|
|
if (tvd->vdev_islog || tvd->vdev_ops == &vdev_hole_ops) {
|
2021-01-11 20:30:33 +03:00
|
|
|
schild[schildren] = fnvlist_alloc();
|
|
|
|
fnvlist_add_string(schild[schildren],
|
|
|
|
ZPOOL_CONFIG_TYPE, VDEV_TYPE_HOLE);
|
|
|
|
fnvlist_add_uint64(schild[schildren],
|
|
|
|
ZPOOL_CONFIG_IS_HOLE, 1);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (lastlogid == 0)
|
|
|
|
lastlogid = schildren;
|
|
|
|
++schildren;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
lastlogid = 0;
|
2021-01-11 20:30:33 +03:00
|
|
|
VERIFY0(nvlist_lookup_nvlist_array(child[c],
|
|
|
|
ZPOOL_CONFIG_CHILDREN, &mchild, &mchildren));
|
|
|
|
schild[schildren++] = fnvlist_dup(mchild[0]);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/* OK, create a config that can be used to split */
|
2021-01-11 20:30:33 +03:00
|
|
|
split = fnvlist_alloc();
|
|
|
|
fnvlist_add_string(split, ZPOOL_CONFIG_TYPE, VDEV_TYPE_ROOT);
|
|
|
|
fnvlist_add_nvlist_array(split, ZPOOL_CONFIG_CHILDREN, schild,
|
|
|
|
lastlogid != 0 ? lastlogid : schildren);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2021-01-11 20:30:33 +03:00
|
|
|
config = fnvlist_alloc();
|
|
|
|
fnvlist_add_nvlist(config, ZPOOL_CONFIG_VDEV_TREE, split);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
for (c = 0; c < schildren; c++)
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(schild[c]);
|
2010-05-29 00:45:14 +04:00
|
|
|
free(schild);
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(split);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
spa_config_exit(spa, SCL_VDEV, FTAG);
|
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_wrlock(&ztest_name_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
error = spa_vdev_split_mirror(spa, "splitp", config, NULL, B_FALSE);
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&ztest_name_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(config);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (error == 0) {
|
|
|
|
(void) printf("successful split - results:\n");
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
show_pool_stats(spa);
|
|
|
|
show_pool_stats(spa_lookup("splitp"));
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
++zs->zs_splits;
|
|
|
|
--zs->zs_mirrors;
|
|
|
|
}
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_exit(&ztest_vdev_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that we can attach and detach devices.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
/* ARGSUSED */
|
2008-11-20 23:01:55 +03:00
|
|
|
void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_vdev_attach_detach(ztest_ds_t *zd, uint64_t id)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_shared_t *zs = ztest_shared;
|
2012-01-24 06:43:32 +04:00
|
|
|
spa_t *spa = ztest_spa;
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_aux_vdev_t *sav = &spa->spa_spares;
|
2008-11-20 23:01:55 +03:00
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
vdev_t *oldvd, *newvd, *pvd;
|
2008-12-03 23:09:06 +03:00
|
|
|
nvlist_t *root;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t leaves;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t leaf, top;
|
|
|
|
uint64_t ashift = ztest_get_ashift();
|
2009-01-16 00:59:39 +03:00
|
|
|
uint64_t oldguid, pguid;
|
2013-08-08 00:16:22 +04:00
|
|
|
uint64_t oldsize, newsize;
|
2010-08-26 22:13:05 +04:00
|
|
|
char *oldpath, *newpath;
|
2008-11-20 23:01:55 +03:00
|
|
|
int replacing;
|
2008-12-03 23:09:06 +03:00
|
|
|
int oldvd_has_siblings = B_FALSE;
|
|
|
|
int newvd_is_spare = B_FALSE;
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
int newvd_is_dspare = B_FALSE;
|
2008-12-03 23:09:06 +03:00
|
|
|
int oldvd_is_log;
|
2008-11-20 23:01:55 +03:00
|
|
|
int error, expected_error;
|
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
if (ztest_opts.zo_mmp_test)
|
|
|
|
return;
|
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
oldpath = umem_alloc(MAXPATHLEN, UMEM_NOFAIL);
|
|
|
|
newpath = umem_alloc(MAXPATHLEN, UMEM_NOFAIL);
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_enter(&ztest_vdev_lock);
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
leaves = MAX(zs->zs_mirrors, 1) * ztest_opts.zo_raid_children;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If a vdev is in the process of being removed, its removal may
|
|
|
|
* finish while we are in progress, leading to an unexpected error
|
|
|
|
* value. Don't bother trying to attach while we are in the middle
|
|
|
|
* of removal.
|
|
|
|
*/
|
OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.
There are two underlying problems:
1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.
The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).
2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.
Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.
The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).
The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.
This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).
Porting notes:
* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
to dsl_scan_needs_resilver(), which was added to ZoL as part of the
sequential scrub work.
* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
parameter. The extra parameter is unique to ZoL.
* When posting indirect checksum errors the ABD can be passed directly,
zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-02-13 22:37:56 +03:00
|
|
|
if (ztest_device_removal_active) {
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
2020-12-25 07:51:13 +03:00
|
|
|
goto out;
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Decide whether to do an attach or a replace.
|
|
|
|
*/
|
|
|
|
replacing = ztest_random(2);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Pick a random top-level vdev.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
top = ztest_random_vdev_top(spa, B_TRUE);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Pick a random leaf within it.
|
|
|
|
*/
|
|
|
|
leaf = ztest_random(leaves);
|
|
|
|
|
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* Locate this vdev.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
oldvd = rvd->vdev_child[top];
|
2018-09-06 04:33:36 +03:00
|
|
|
|
|
|
|
/* pick a child from the mirror */
|
2010-05-29 00:45:14 +04:00
|
|
|
if (zs->zs_mirrors >= 1) {
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3P(oldvd->vdev_ops, ==, &vdev_mirror_ops);
|
|
|
|
ASSERT3U(oldvd->vdev_children, >=, zs->zs_mirrors);
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
oldvd = oldvd->vdev_child[leaf / ztest_opts.zo_raid_children];
|
2009-01-16 00:59:39 +03:00
|
|
|
}
|
2018-09-06 04:33:36 +03:00
|
|
|
|
|
|
|
/* pick a child out of the raidz group */
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
if (ztest_opts.zo_raid_children > 1) {
|
|
|
|
if (strcmp(oldvd->vdev_ops->vdev_op_type, "raidz") == 0)
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3P(oldvd->vdev_ops, ==, &vdev_raidz_ops);
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
else
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3P(oldvd->vdev_ops, ==, &vdev_draid_ops);
|
|
|
|
ASSERT3U(oldvd->vdev_children, ==, ztest_opts.zo_raid_children);
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
oldvd = oldvd->vdev_child[leaf % ztest_opts.zo_raid_children];
|
2009-01-16 00:59:39 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* If we're already doing an attach or replace, oldvd may be a
|
|
|
|
* mirror vdev -- in which case, pick a random child.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
while (oldvd->vdev_children != 0) {
|
|
|
|
oldvd_has_siblings = B_TRUE;
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(oldvd->vdev_children, >=, 2);
|
2009-01-16 00:59:39 +03:00
|
|
|
oldvd = oldvd->vdev_child[ztest_random(oldvd->vdev_children)];
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
oldguid = oldvd->vdev_guid;
|
2009-07-03 02:44:48 +04:00
|
|
|
oldsize = vdev_get_min_asize(oldvd);
|
2008-12-03 23:09:06 +03:00
|
|
|
oldvd_is_log = oldvd->vdev_top->vdev_islog;
|
|
|
|
(void) strcpy(oldpath, oldvd->vdev_path);
|
|
|
|
pvd = oldvd->vdev_parent;
|
2009-01-16 00:59:39 +03:00
|
|
|
pguid = pvd->vdev_guid;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2019-01-18 20:47:55 +03:00
|
|
|
* If oldvd has siblings, then half of the time, detach it. Prior
|
|
|
|
* to the detach the pool is scrubbed in order to prevent creating
|
|
|
|
* unrepairable blocks as a result of the data corruption injection.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
if (oldvd_has_siblings && ztest_random(2) == 0) {
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
2019-01-18 20:47:55 +03:00
|
|
|
|
|
|
|
error = ztest_scrub_impl(spa);
|
|
|
|
if (error)
|
|
|
|
goto out;
|
|
|
|
|
2009-01-16 00:59:39 +03:00
|
|
|
error = spa_vdev_detach(spa, oldguid, pguid, B_FALSE);
|
|
|
|
if (error != 0 && error != ENODEV && error != EBUSY &&
|
2016-12-17 01:11:29 +03:00
|
|
|
error != ENOTSUP && error != ZFS_ERR_CHECKPOINT_EXISTS &&
|
|
|
|
error != ZFS_ERR_DISCARDING_CHECKPOINT)
|
2009-01-16 00:59:39 +03:00
|
|
|
fatal(0, "detach (%s) returned %d", oldpath, error);
|
2010-08-26 22:13:05 +04:00
|
|
|
goto out;
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* For the new vdev, choose with equal probability between the two
|
|
|
|
* standard paths (ending in either 'a' or 'b') or a random hot spare.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
if (sav->sav_count != 0 && ztest_random(3) == 0) {
|
|
|
|
newvd = sav->sav_vdevs[ztest_random(sav->sav_count)];
|
|
|
|
newvd_is_spare = B_TRUE;
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
|
|
|
|
if (newvd->vdev_ops == &vdev_draid_spare_ops)
|
|
|
|
newvd_is_dspare = B_TRUE;
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
(void) strcpy(newpath, newvd->vdev_path);
|
|
|
|
} else {
|
Fix ztest vdev file paths.
Currently, in several instances (but not all), ztest generates vdev
file paths using a statement similar to this:
snprintf(path, sizeof (path), ztest_dev_template, ...);
This worked fine until 40b84e7aec6392187722e61e5a4a853b530bf60f, which
changed path to be a pointer to the heap instead of an array allocated
on the stack. Before this change, sizeof(path) would return the size of
the array; now, it returns the size of the pointer instead.
As a result, the aforementioned sprintf statement uses the wrong size
and truncates the vdev file path to the first 4 or 8 bytes (depending
on the architecture). Typically, with default settings, the file path
will become "/tmp/zt" instead of "/test/ztest.XXX".
This issue only exists in ztest_vdev_attach_detach() and
ztest_fault_inject(), which explains why ztest doesn't fail right away.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #989
2012-10-01 18:34:52 +04:00
|
|
|
(void) snprintf(newpath, MAXPATHLEN, ztest_dev_template,
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_opts.zo_dir, ztest_opts.zo_pool,
|
|
|
|
top * leaves + leaf);
|
2008-12-03 23:09:06 +03:00
|
|
|
if (ztest_random(2) == 0)
|
|
|
|
newpath[strlen(newpath) - 1] = 'b';
|
|
|
|
newvd = vdev_lookup_by_path(rvd, newpath);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (newvd) {
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
/*
|
|
|
|
* Reopen to ensure the vdev's asize field isn't stale.
|
|
|
|
*/
|
|
|
|
vdev_reopen(newvd);
|
2009-07-03 02:44:48 +04:00
|
|
|
newsize = vdev_get_min_asize(newvd);
|
2008-12-03 23:09:06 +03:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Make newsize a little bigger or smaller than oldsize.
|
|
|
|
* If it's smaller, the attach should fail.
|
|
|
|
* If it's larger, and we're doing a replace,
|
|
|
|
* we should get dynamic LUN growth when we're done.
|
|
|
|
*/
|
|
|
|
newsize = 10 * oldsize / (9 + ztest_random(3));
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If pvd is not a mirror or root, the attach should fail with ENOTSUP,
|
|
|
|
* unless it's a replace; in that case any non-replacing parent is OK.
|
|
|
|
*
|
|
|
|
* If newvd is already part of the pool, it should fail with EBUSY.
|
|
|
|
*
|
|
|
|
* If newvd is too small, it should fail with EOVERFLOW.
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
*
|
|
|
|
* If newvd is a distributed spare and it's being attached to a
|
|
|
|
* dRAID which is not its parent it should fail with EINVAL.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
if (pvd->vdev_ops != &vdev_mirror_ops &&
|
|
|
|
pvd->vdev_ops != &vdev_root_ops && (!replacing ||
|
|
|
|
pvd->vdev_ops == &vdev_replacing_ops ||
|
|
|
|
pvd->vdev_ops == &vdev_spare_ops))
|
2008-11-20 23:01:55 +03:00
|
|
|
expected_error = ENOTSUP;
|
2008-12-03 23:09:06 +03:00
|
|
|
else if (newvd_is_spare && (!replacing || oldvd_is_log))
|
|
|
|
expected_error = ENOTSUP;
|
|
|
|
else if (newvd == oldvd)
|
|
|
|
expected_error = replacing ? 0 : EBUSY;
|
|
|
|
else if (vdev_lookup_by_path(rvd, newpath) != NULL)
|
|
|
|
expected_error = EBUSY;
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
else if (!newvd_is_dspare && newsize < oldsize)
|
2008-11-20 23:01:55 +03:00
|
|
|
expected_error = EOVERFLOW;
|
|
|
|
else if (ashift > oldvd->vdev_top->vdev_ashift)
|
|
|
|
expected_error = EDOM;
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
else if (newvd_is_dspare && pvd != vdev_draid_spare_get_parent(newvd))
|
|
|
|
expected_error = ENOTSUP;
|
2008-11-20 23:01:55 +03:00
|
|
|
else
|
|
|
|
expected_error = 0;
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Build the nvlist describing newpath.
|
|
|
|
*/
|
2012-12-15 04:28:49 +04:00
|
|
|
root = make_vdev_root(newpath, NULL, NULL, newvd == NULL ? newsize : 0,
|
2018-09-06 04:33:36 +03:00
|
|
|
ashift, NULL, 0, 0, 1);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2020-07-03 21:05:50 +03:00
|
|
|
/*
|
|
|
|
* When supported select either a healing or sequential resilver.
|
|
|
|
*/
|
|
|
|
boolean_t rebuilding = B_FALSE;
|
|
|
|
if (pvd->vdev_ops == &vdev_mirror_ops ||
|
|
|
|
pvd->vdev_ops == &vdev_root_ops) {
|
|
|
|
rebuilding = !!ztest_random(2);
|
|
|
|
}
|
|
|
|
|
|
|
|
error = spa_vdev_attach(spa, oldguid, root, replacing, rebuilding);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(root);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If our parent was the replacing vdev, but the replace completed,
|
|
|
|
* then instead of failing with ENOTSUP we may either succeed,
|
|
|
|
* fail with ENODEV, or fail with EOVERFLOW.
|
|
|
|
*/
|
|
|
|
if (expected_error == ENOTSUP &&
|
|
|
|
(error == 0 || error == ENODEV || error == EOVERFLOW))
|
|
|
|
expected_error = error;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If someone grew the LUN, the replacement may be too small.
|
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
if (error == EOVERFLOW || error == EBUSY)
|
2008-11-20 23:01:55 +03:00
|
|
|
expected_error = error;
|
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
if (error == ZFS_ERR_CHECKPOINT_EXISTS ||
|
2020-07-03 21:05:50 +03:00
|
|
|
error == ZFS_ERR_DISCARDING_CHECKPOINT ||
|
|
|
|
error == ZFS_ERR_RESILVER_IN_PROGRESS ||
|
|
|
|
error == ZFS_ERR_REBUILD_IN_PROGRESS)
|
2016-12-17 01:11:29 +03:00
|
|
|
expected_error = error;
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
if (error != expected_error && expected_error != EBUSY) {
|
|
|
|
fatal(0, "attach (%s %llu, %s %llu, %d) "
|
|
|
|
"returned %d, expected %d",
|
2013-08-08 00:16:22 +04:00
|
|
|
oldpath, oldsize, newpath,
|
|
|
|
newsize, replacing, error, expected_error);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-08-26 22:13:05 +04:00
|
|
|
out:
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_exit(&ztest_vdev_lock);
|
2010-08-26 22:13:05 +04:00
|
|
|
|
|
|
|
umem_free(oldpath, MAXPATHLEN);
|
|
|
|
umem_free(newpath, MAXPATHLEN);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
/* ARGSUSED */
|
|
|
|
void
|
|
|
|
ztest_device_removal(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
|
|
|
spa_t *spa = ztest_spa;
|
|
|
|
vdev_t *vd;
|
|
|
|
uint64_t guid;
|
OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.
There are two underlying problems:
1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.
The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).
2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.
Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.
The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).
The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.
This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).
Porting notes:
* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
to dsl_scan_needs_resilver(), which was added to ZoL as part of the
sequential scrub work.
* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
parameter. The extra parameter is unique to ZoL.
* When posting indirect checksum errors the ABD can be passed directly,
zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-02-13 22:37:56 +03:00
|
|
|
int error;
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
|
|
|
mutex_enter(&ztest_vdev_lock);
|
|
|
|
|
OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.
There are two underlying problems:
1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.
The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).
2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.
Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.
The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).
The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.
This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).
Porting notes:
* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
to dsl_scan_needs_resilver(), which was added to ZoL as part of the
sequential scrub work.
* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
parameter. The extra parameter is unique to ZoL.
* When posting indirect checksum errors the ABD can be passed directly,
zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-02-13 22:37:56 +03:00
|
|
|
if (ztest_device_removal_active) {
|
|
|
|
mutex_exit(&ztest_vdev_lock);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remove a random top-level vdev and wait for removal to finish.
|
|
|
|
*/
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
|
|
|
|
vd = vdev_lookup_top(spa, ztest_random_vdev_top(spa, B_FALSE));
|
|
|
|
guid = vd->vdev_guid;
|
|
|
|
spa_config_exit(spa, SCL_VDEV, FTAG);
|
|
|
|
|
OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.
There are two underlying problems:
1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.
The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).
2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.
Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.
The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).
The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.
This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).
Porting notes:
* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
to dsl_scan_needs_resilver(), which was added to ZoL as part of the
sequential scrub work.
* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
parameter. The extra parameter is unique to ZoL.
* When posting indirect checksum errors the ABD can be passed directly,
zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-02-13 22:37:56 +03:00
|
|
|
error = spa_vdev_remove(spa, guid, B_FALSE);
|
|
|
|
if (error == 0) {
|
|
|
|
ztest_device_removal_active = B_TRUE;
|
|
|
|
mutex_exit(&ztest_vdev_lock);
|
|
|
|
|
2018-10-22 16:54:01 +03:00
|
|
|
/*
|
|
|
|
* spa->spa_vdev_removal is created in a sync task that
|
|
|
|
* is initiated via dsl_sync_task_nowait(). Since the
|
|
|
|
* task may not run before spa_vdev_remove() returns, we
|
|
|
|
* must wait at least 1 txg to ensure that the removal
|
|
|
|
* struct has been created.
|
|
|
|
*/
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
|
2018-11-29 07:47:09 +03:00
|
|
|
while (spa->spa_removing_phys.sr_state == DSS_SCANNING)
|
OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.
There are two underlying problems:
1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.
The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).
2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.
Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.
The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).
The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.
This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).
Porting notes:
* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
to dsl_scan_needs_resilver(), which was added to ZoL as part of the
sequential scrub work.
* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
parameter. The extra parameter is unique to ZoL.
* When posting indirect checksum errors the ABD can be passed directly,
zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-02-13 22:37:56 +03:00
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
} else {
|
|
|
|
mutex_exit(&ztest_vdev_lock);
|
|
|
|
return;
|
|
|
|
}
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.
There are two underlying problems:
1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.
The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).
2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.
Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.
The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).
The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.
This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).
Porting notes:
* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
to dsl_scan_needs_resilver(), which was added to ZoL as part of the
sequential scrub work.
* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
parameter. The extra parameter is unique to ZoL.
* When posting indirect checksum errors the ABD can be passed directly,
zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-02-13 22:37:56 +03:00
|
|
|
/*
|
|
|
|
* The pool needs to be scrubbed after completing device removal.
|
|
|
|
* Failure to do so may result in checksum errors due to the
|
|
|
|
* strategy employed by ztest_fault_inject() when selecting which
|
|
|
|
* offset are redundant and can be damaged.
|
|
|
|
*/
|
|
|
|
error = spa_scan(spa, POOL_SCAN_SCRUB);
|
|
|
|
if (error == 0) {
|
|
|
|
while (dsl_scan_scrubbing(spa_get_dsl(spa)))
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_enter(&ztest_vdev_lock);
|
|
|
|
ztest_device_removal_active = B_FALSE;
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
mutex_exit(&ztest_vdev_lock);
|
|
|
|
}
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
/*
|
|
|
|
* Callback function which expands the physical size of the vdev.
|
|
|
|
*/
|
2020-06-15 21:30:37 +03:00
|
|
|
static vdev_t *
|
2009-07-03 02:44:48 +04:00
|
|
|
grow_vdev(vdev_t *vd, void *arg)
|
|
|
|
{
|
2019-12-05 23:37:00 +03:00
|
|
|
spa_t *spa __maybe_unused = vd->vdev_spa;
|
2009-07-03 02:44:48 +04:00
|
|
|
size_t *newsize = arg;
|
|
|
|
size_t fsize;
|
|
|
|
int fd;
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3S(spa_config_held(spa, SCL_STATE, RW_READER), ==, SCL_STATE);
|
2009-07-03 02:44:48 +04:00
|
|
|
ASSERT(vd->vdev_ops->vdev_op_leaf);
|
|
|
|
|
|
|
|
if ((fd = open(vd->vdev_path, O_RDWR)) == -1)
|
|
|
|
return (vd);
|
|
|
|
|
|
|
|
fsize = lseek(fd, 0, SEEK_END);
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(ftruncate(fd, *newsize));
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 6) {
|
2009-07-03 02:44:48 +04:00
|
|
|
(void) printf("%s grew from %lu to %lu bytes\n",
|
|
|
|
vd->vdev_path, (ulong_t)fsize, (ulong_t)*newsize);
|
|
|
|
}
|
|
|
|
(void) close(fd);
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Callback function which expands a given vdev by calling vdev_online().
|
|
|
|
*/
|
|
|
|
/* ARGSUSED */
|
2020-06-15 21:30:37 +03:00
|
|
|
static vdev_t *
|
2009-07-03 02:44:48 +04:00
|
|
|
online_vdev(vdev_t *vd, void *arg)
|
|
|
|
{
|
|
|
|
spa_t *spa = vd->vdev_spa;
|
|
|
|
vdev_t *tvd = vd->vdev_top;
|
|
|
|
uint64_t guid = vd->vdev_guid;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t generation = spa->spa_config_generation + 1;
|
|
|
|
vdev_state_t newstate = VDEV_STATE_UNKNOWN;
|
|
|
|
int error;
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3S(spa_config_held(spa, SCL_STATE, RW_READER), ==, SCL_STATE);
|
2009-07-03 02:44:48 +04:00
|
|
|
ASSERT(vd->vdev_ops->vdev_op_leaf);
|
|
|
|
|
|
|
|
/* Calling vdev_online will initialize the new metaslabs */
|
|
|
|
spa_config_exit(spa, SCL_STATE, spa);
|
2010-05-29 00:45:14 +04:00
|
|
|
error = vdev_online(spa, guid, ZFS_ONLINE_EXPAND, &newstate);
|
2009-07-03 02:44:48 +04:00
|
|
|
spa_config_enter(spa, SCL_STATE, spa, RW_READER);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* If vdev_online returned an error or the underlying vdev_open
|
|
|
|
* failed then we abort the expand. The only way to know that
|
|
|
|
* vdev_open fails is by checking the returned newstate.
|
|
|
|
*/
|
|
|
|
if (error || newstate != VDEV_STATE_HEALTHY) {
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 5) {
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("Unable to expand vdev, state %llu, "
|
|
|
|
"error %d\n", (u_longlong_t)newstate, error);
|
|
|
|
}
|
|
|
|
return (vd);
|
|
|
|
}
|
|
|
|
ASSERT3U(newstate, ==, VDEV_STATE_HEALTHY);
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
/*
|
|
|
|
* Since we dropped the lock we need to ensure that we're
|
|
|
|
* still talking to the original vdev. It's possible this
|
|
|
|
* vdev may have been detached/replaced while we were
|
|
|
|
* trying to online it.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
if (generation != spa->spa_config_generation) {
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 5) {
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("vdev configuration has changed, "
|
|
|
|
"guid %llu, state %llu, expected gen %llu, "
|
|
|
|
"got gen %llu\n",
|
|
|
|
(u_longlong_t)guid,
|
|
|
|
(u_longlong_t)tvd->vdev_state,
|
|
|
|
(u_longlong_t)generation,
|
|
|
|
(u_longlong_t)spa->spa_config_generation);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
return (vd);
|
|
|
|
}
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Traverse the vdev tree calling the supplied function.
|
|
|
|
* We continue to walk the tree until we either have walked all
|
|
|
|
* children or we receive a non-NULL return from the callback.
|
|
|
|
* If a NULL callback is passed, then we just return back the first
|
|
|
|
* leaf vdev we encounter.
|
|
|
|
*/
|
2020-06-15 21:30:37 +03:00
|
|
|
static vdev_t *
|
2009-07-03 02:44:48 +04:00
|
|
|
vdev_walk_tree(vdev_t *vd, vdev_t *(*func)(vdev_t *, void *), void *arg)
|
|
|
|
{
|
2010-08-26 20:52:39 +04:00
|
|
|
uint_t c;
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
if (vd->vdev_ops->vdev_op_leaf) {
|
|
|
|
if (func == NULL)
|
|
|
|
return (vd);
|
|
|
|
else
|
|
|
|
return (func(vd, arg));
|
|
|
|
}
|
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (c = 0; c < vd->vdev_children; c++) {
|
2009-07-03 02:44:48 +04:00
|
|
|
vdev_t *cvd = vd->vdev_child[c];
|
|
|
|
if ((cvd = vdev_walk_tree(cvd, func, arg)) != NULL)
|
|
|
|
return (cvd);
|
|
|
|
}
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Verify that dynamic LUN growth works as expected.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
/* ARGSUSED */
|
2008-11-20 23:01:55 +03:00
|
|
|
void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_vdev_LUN_growth(ztest_ds_t *zd, uint64_t id)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2012-01-24 06:43:32 +04:00
|
|
|
spa_t *spa = ztest_spa;
|
2010-05-29 00:45:14 +04:00
|
|
|
vdev_t *vd, *tvd;
|
|
|
|
metaslab_class_t *mc;
|
|
|
|
metaslab_group_t *mg;
|
2009-07-03 02:44:48 +04:00
|
|
|
size_t psize, newsize;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t top;
|
|
|
|
uint64_t old_class_space, new_class_space, old_ms_count, new_ms_count;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
mutex_enter(&ztest_checkpoint_lock);
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_enter(&ztest_vdev_lock);
|
2009-07-03 02:44:48 +04:00
|
|
|
spa_config_enter(spa, SCL_STATE, spa, RW_READER);
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
/*
|
|
|
|
* If there is a vdev removal in progress, it could complete while
|
|
|
|
* we are running, in which case we would not be able to verify
|
|
|
|
* that the metaslab_class space increased (because it decreases
|
|
|
|
* when the device removal completes).
|
|
|
|
*/
|
OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.
There are two underlying problems:
1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.
The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).
2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.
Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.
The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).
The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.
This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).
Porting notes:
* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
to dsl_scan_needs_resilver(), which was added to ZoL as part of the
sequential scrub work.
* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
parameter. The extra parameter is unique to ZoL.
* When posting indirect checksum errors the ABD can be passed directly,
zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-02-13 22:37:56 +03:00
|
|
|
if (ztest_device_removal_active) {
|
2016-12-17 01:11:29 +03:00
|
|
|
spa_config_exit(spa, SCL_STATE, spa);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
mutex_exit(&ztest_vdev_lock);
|
2016-12-17 01:11:29 +03:00
|
|
|
mutex_exit(&ztest_checkpoint_lock);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
top = ztest_random_vdev_top(spa, B_TRUE);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
tvd = spa->spa_root_vdev->vdev_child[top];
|
|
|
|
mg = tvd->vdev_mg;
|
|
|
|
mc = mg->mg_class;
|
|
|
|
old_ms_count = tvd->vdev_ms_count;
|
|
|
|
old_class_space = metaslab_class_get_space(mc);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2009-07-03 02:44:48 +04:00
|
|
|
* Determine the size of the first leaf vdev associated with
|
|
|
|
* our top-level device.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2009-07-03 02:44:48 +04:00
|
|
|
vd = vdev_walk_tree(tvd, NULL, NULL);
|
|
|
|
ASSERT3P(vd, !=, NULL);
|
|
|
|
ASSERT(vd->vdev_ops->vdev_op_leaf);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
psize = vd->vdev_psize;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* We only try to expand the vdev if it's healthy, less than 4x its
|
|
|
|
* original size, and it has a valid psize.
|
2009-07-03 02:44:48 +04:00
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
if (tvd->vdev_state != VDEV_STATE_HEALTHY ||
|
2012-01-24 06:43:32 +04:00
|
|
|
psize == 0 || psize >= 4 * ztest_opts.zo_vdev_size) {
|
2009-07-03 02:44:48 +04:00
|
|
|
spa_config_exit(spa, SCL_STATE, spa);
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_exit(&ztest_vdev_lock);
|
2016-12-17 01:11:29 +03:00
|
|
|
mutex_exit(&ztest_checkpoint_lock);
|
2009-07-03 02:44:48 +04:00
|
|
|
return;
|
|
|
|
}
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(psize, >, 0);
|
2018-09-06 04:33:36 +03:00
|
|
|
newsize = psize + MAX(psize / 8, SPA_MAXBLOCKSIZE);
|
2009-07-03 02:44:48 +04:00
|
|
|
ASSERT3U(newsize, >, psize);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 6) {
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("Expanding LUN %s from %lu to %lu\n",
|
2009-07-03 02:44:48 +04:00
|
|
|
vd->vdev_path, (ulong_t)psize, (ulong_t)newsize);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Growing the vdev is a two step process:
|
|
|
|
* 1). expand the physical size (i.e. relabel)
|
|
|
|
* 2). online the vdev to create the new metaslabs
|
|
|
|
*/
|
|
|
|
if (vdev_walk_tree(tvd, grow_vdev, &newsize) != NULL ||
|
|
|
|
vdev_walk_tree(tvd, online_vdev, NULL) != NULL ||
|
|
|
|
tvd->vdev_state != VDEV_STATE_HEALTHY) {
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 5) {
|
2009-07-03 02:44:48 +04:00
|
|
|
(void) printf("Could not expand LUN because "
|
2010-05-29 00:45:14 +04:00
|
|
|
"the vdev configuration changed.\n");
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_config_exit(spa, SCL_STATE, spa);
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_exit(&ztest_vdev_lock);
|
2016-12-17 01:11:29 +03:00
|
|
|
mutex_exit(&ztest_checkpoint_lock);
|
2009-07-03 02:44:48 +04:00
|
|
|
return;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_config_exit(spa, SCL_STATE, spa);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Expanding the LUN will update the config asynchronously,
|
|
|
|
* thus we must wait for the async thread to complete any
|
|
|
|
* pending tasks before proceeding.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
for (;;) {
|
|
|
|
boolean_t done;
|
|
|
|
mutex_enter(&spa->spa_async_lock);
|
|
|
|
done = (spa->spa_async_thread == NULL && !spa->spa_async_tasks);
|
|
|
|
mutex_exit(&spa->spa_async_lock);
|
|
|
|
if (done)
|
|
|
|
break;
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
(void) poll(NULL, 0, 100);
|
|
|
|
}
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_STATE, spa, RW_READER);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
tvd = spa->spa_root_vdev->vdev_child[top];
|
|
|
|
new_ms_count = tvd->vdev_ms_count;
|
|
|
|
new_class_space = metaslab_class_get_space(mc);
|
|
|
|
|
|
|
|
if (tvd->vdev_mg != mg || mg->mg_class != mc) {
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 5) {
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("Could not verify LUN expansion due to "
|
|
|
|
"intervening vdev offline or remove.\n");
|
|
|
|
}
|
|
|
|
spa_config_exit(spa, SCL_STATE, spa);
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_exit(&ztest_vdev_lock);
|
2016-12-17 01:11:29 +03:00
|
|
|
mutex_exit(&ztest_checkpoint_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Make sure we were able to grow the vdev.
|
|
|
|
*/
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
if (new_ms_count <= old_ms_count) {
|
|
|
|
fatal(0, "LUN expansion failed: ms_count %llu < %llu\n",
|
2010-05-29 00:45:14 +04:00
|
|
|
old_ms_count, new_ms_count);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
}
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Make sure we were able to grow the pool.
|
|
|
|
*/
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
if (new_class_space <= old_class_space) {
|
|
|
|
fatal(0, "LUN expansion failed: class_space %llu < %llu\n",
|
2010-05-29 00:45:14 +04:00
|
|
|
old_class_space, new_class_space);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 5) {
|
2017-06-13 12:16:45 +03:00
|
|
|
char oldnumbuf[NN_NUMBUF_SZ], newnumbuf[NN_NUMBUF_SZ];
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
nicenum(old_class_space, oldnumbuf, sizeof (oldnumbuf));
|
|
|
|
nicenum(new_class_space, newnumbuf, sizeof (newnumbuf));
|
2009-07-03 02:44:48 +04:00
|
|
|
(void) printf("%s grew from %s to %s\n",
|
|
|
|
spa->spa_name, oldnumbuf, newnumbuf);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
spa_config_exit(spa, SCL_STATE, spa);
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_exit(&ztest_vdev_lock);
|
2016-12-17 01:11:29 +03:00
|
|
|
mutex_exit(&ztest_checkpoint_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Verify that dmu_objset_{create,destroy,open,close} work as expected.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
/* ARGSUSED */
|
|
|
|
static void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_objset_create_cb(objset_t *os, void *arg, cred_t *cr, dmu_tx_t *tx)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Create the objects common to all ztest datasets.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(zap_create_claim(os, ZTEST_DIROBJ,
|
|
|
|
DMU_OT_ZAP_OTHER, DMU_OT_NONE, 0, tx));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static int
|
|
|
|
ztest_dataset_create(char *dsname)
|
|
|
|
{
|
2017-09-12 23:15:11 +03:00
|
|
|
int err;
|
|
|
|
uint64_t rand;
|
|
|
|
dsl_crypto_params_t *dcp = NULL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* 50% of the time, we create encrypted datasets
|
|
|
|
* using a random cipher suite and a hard-coded
|
|
|
|
* wrapping key.
|
|
|
|
*/
|
|
|
|
rand = ztest_random(2);
|
|
|
|
if (rand != 0) {
|
|
|
|
nvlist_t *crypto_args = fnvlist_alloc();
|
|
|
|
nvlist_t *props = fnvlist_alloc();
|
|
|
|
|
|
|
|
/* slight bias towards the default cipher suite */
|
|
|
|
rand = ztest_random(ZIO_CRYPT_FUNCTIONS);
|
|
|
|
if (rand < ZIO_CRYPT_AES_128_CCM)
|
|
|
|
rand = ZIO_CRYPT_ON;
|
|
|
|
|
|
|
|
fnvlist_add_uint64(props,
|
|
|
|
zfs_prop_to_name(ZFS_PROP_ENCRYPTION), rand);
|
|
|
|
fnvlist_add_uint8_array(crypto_args, "wkeydata",
|
|
|
|
(uint8_t *)ztest_wkeydata, WRAPPING_KEY_LEN);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* These parameters aren't really used by the kernel. They
|
|
|
|
* are simply stored so that userspace knows how to load
|
|
|
|
* the wrapping key.
|
|
|
|
*/
|
|
|
|
fnvlist_add_uint64(props,
|
|
|
|
zfs_prop_to_name(ZFS_PROP_KEYFORMAT), ZFS_KEYFORMAT_RAW);
|
|
|
|
fnvlist_add_string(props,
|
|
|
|
zfs_prop_to_name(ZFS_PROP_KEYLOCATION), "prompt");
|
|
|
|
fnvlist_add_uint64(props,
|
|
|
|
zfs_prop_to_name(ZFS_PROP_PBKDF2_SALT), 0ULL);
|
|
|
|
fnvlist_add_uint64(props,
|
|
|
|
zfs_prop_to_name(ZFS_PROP_PBKDF2_ITERS), 0ULL);
|
|
|
|
|
|
|
|
VERIFY0(dsl_crypto_params_create_nvlist(DCP_CMD_NONE, props,
|
|
|
|
crypto_args, &dcp));
|
|
|
|
|
2018-08-02 21:59:24 +03:00
|
|
|
/*
|
|
|
|
* Cycle through all available encryption implementations
|
|
|
|
* to verify interoperability.
|
|
|
|
*/
|
|
|
|
VERIFY0(gcm_impl_set("cycle"));
|
|
|
|
VERIFY0(aes_impl_set("cycle"));
|
|
|
|
|
2017-09-12 23:15:11 +03:00
|
|
|
fnvlist_free(crypto_args);
|
|
|
|
fnvlist_free(props);
|
|
|
|
}
|
|
|
|
|
|
|
|
err = dmu_objset_create(dsname, DMU_OST_OTHER, 0, dcp,
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_objset_create_cb, NULL);
|
2017-09-12 23:15:11 +03:00
|
|
|
dsl_crypto_params_free(dcp, !!err);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2017-09-12 23:15:11 +03:00
|
|
|
rand = ztest_random(100);
|
|
|
|
if (err || rand < 80)
|
2010-05-29 00:45:14 +04:00
|
|
|
return (err);
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 5)
|
2012-09-10 18:23:21 +04:00
|
|
|
(void) printf("Setting dataset %s to sync always\n", dsname);
|
2010-05-29 00:45:14 +04:00
|
|
|
return (ztest_dsl_prop_set_uint64(dsname, ZFS_PROP_SYNC,
|
|
|
|
ZFS_SYNC_ALWAYS, B_FALSE));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/* ARGSUSED */
|
2008-11-20 23:01:55 +03:00
|
|
|
static int
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_objset_destroy_cb(const char *name, void *arg)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
objset_t *os;
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_object_info_t doi;
|
2008-11-20 23:01:55 +03:00
|
|
|
int error;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that the dataset contains a directory object.
|
|
|
|
*/
|
2017-09-12 23:15:11 +03:00
|
|
|
VERIFY0(ztest_dmu_objset_own(name, DMU_OST_OTHER, B_TRUE,
|
|
|
|
B_TRUE, FTAG, &os));
|
2010-05-29 00:45:14 +04:00
|
|
|
error = dmu_object_info(os, ZTEST_DIROBJ, &doi);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (error != ENOENT) {
|
|
|
|
/* We could have crashed in the middle of destroying it */
|
2013-05-11 01:17:03 +04:00
|
|
|
ASSERT0(error);
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT3U(doi.doi_type, ==, DMU_OT_ZAP_OTHER);
|
|
|
|
ASSERT3S(doi.doi_physical_blocks_512, >=, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dmu_objset_disown(os, B_TRUE, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Destroy the dataset.
|
|
|
|
*/
|
2013-09-04 16:00:57 +04:00
|
|
|
if (strchr(name, '@') != NULL) {
|
2017-01-26 23:32:36 +03:00
|
|
|
VERIFY0(dsl_destroy_snapshot(name, B_TRUE));
|
2013-09-04 16:00:57 +04:00
|
|
|
} else {
|
2017-01-26 23:32:36 +03:00
|
|
|
error = dsl_destroy_head(name);
|
2018-12-14 21:03:05 +03:00
|
|
|
if (error == ENOSPC) {
|
|
|
|
/* There could be checkpoint or insufficient slop */
|
|
|
|
ztest_record_enospc(FTAG);
|
|
|
|
} else if (error != EBUSY) {
|
|
|
|
/* There could be a hold on this dataset */
|
2017-01-26 23:32:36 +03:00
|
|
|
ASSERT0(error);
|
2018-12-14 21:03:05 +03:00
|
|
|
}
|
2013-09-04 16:00:57 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static boolean_t
|
|
|
|
ztest_snapshot_create(char *osname, uint64_t id)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2016-06-16 00:28:36 +03:00
|
|
|
char snapname[ZFS_MAX_DATASET_NAME_LEN];
|
2010-05-29 00:45:14 +04:00
|
|
|
int error;
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
(void) snprintf(snapname, sizeof (snapname), "%llu", (u_longlong_t)id);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dmu_objset_snapshot_one(osname, snapname);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error == ENOSPC) {
|
|
|
|
ztest_record_enospc(FTAG);
|
|
|
|
return (B_FALSE);
|
|
|
|
}
|
2013-09-04 16:00:57 +04:00
|
|
|
if (error != 0 && error != EEXIST) {
|
|
|
|
fatal(0, "ztest_snapshot_create(%s@%s) = %d", osname,
|
|
|
|
snapname, error);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
return (B_TRUE);
|
|
|
|
}
|
|
|
|
|
|
|
|
static boolean_t
|
|
|
|
ztest_snapshot_destroy(char *osname, uint64_t id)
|
|
|
|
{
|
2016-06-16 00:28:36 +03:00
|
|
|
char snapname[ZFS_MAX_DATASET_NAME_LEN];
|
2010-05-29 00:45:14 +04:00
|
|
|
int error;
|
|
|
|
|
2016-06-16 00:28:36 +03:00
|
|
|
(void) snprintf(snapname, sizeof (snapname), "%s@%llu", osname,
|
2010-05-29 00:45:14 +04:00
|
|
|
(u_longlong_t)id);
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dsl_destroy_snapshot(snapname, B_FALSE);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error != 0 && error != ENOENT)
|
|
|
|
fatal(0, "ztest_snapshot_destroy(%s) = %d", snapname, error);
|
|
|
|
return (B_TRUE);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/* ARGSUSED */
|
2008-11-20 23:01:55 +03:00
|
|
|
void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_dmu_objset_create_destroy(ztest_ds_t *zd, uint64_t id)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-08-26 22:13:05 +04:00
|
|
|
ztest_ds_t *zdtmp;
|
2010-05-29 00:45:14 +04:00
|
|
|
int iters;
|
2008-11-20 23:01:55 +03:00
|
|
|
int error;
|
2008-12-03 23:09:06 +03:00
|
|
|
objset_t *os, *os2;
|
2016-06-16 00:28:36 +03:00
|
|
|
char name[ZFS_MAX_DATASET_NAME_LEN];
|
2008-11-20 23:01:55 +03:00
|
|
|
zilog_t *zilog;
|
2010-08-26 20:52:39 +04:00
|
|
|
int i;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
zdtmp = umem_alloc(sizeof (ztest_ds_t), UMEM_NOFAIL);
|
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_rdlock(&ztest_name_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-16 00:28:36 +03:00
|
|
|
(void) snprintf(name, sizeof (name), "%s/temp_%llu",
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_opts.zo_pool, (u_longlong_t)id);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If this dataset exists from a previous run, process its replay log
|
2013-09-04 16:00:57 +04:00
|
|
|
* half of the time. If we don't replay it, then dsl_destroy_head()
|
2010-05-29 00:45:14 +04:00
|
|
|
* (invoked from ztest_objset_destroy_cb()) should just throw it away.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
if (ztest_random(2) == 0 &&
|
2017-09-12 23:15:11 +03:00
|
|
|
ztest_dmu_objset_own(name, DMU_OST_OTHER, B_FALSE,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
B_TRUE, FTAG, &os) == 0) {
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_zd_init(zdtmp, NULL, os);
|
2010-08-26 22:13:05 +04:00
|
|
|
zil_replay(os, zdtmp, ztest_replay_vector);
|
|
|
|
ztest_zd_fini(zdtmp);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dmu_objset_disown(os, B_TRUE, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* There may be an old instance of the dataset we're about to
|
|
|
|
* create lying around from a previous run. If so, destroy it
|
|
|
|
* and all of its snapshots.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) dmu_objset_find(name, ztest_objset_destroy_cb, NULL,
|
2008-11-20 23:01:55 +03:00
|
|
|
DS_FIND_CHILDREN | DS_FIND_SNAPSHOTS);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that the destroyed dataset is no longer in the namespace.
|
|
|
|
*/
|
2017-09-12 23:15:11 +03:00
|
|
|
VERIFY3U(ENOENT, ==, ztest_dmu_objset_own(name, DMU_OST_OTHER, B_TRUE,
|
|
|
|
B_TRUE, FTAG, &os));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that we can create a new dataset.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
error = ztest_dataset_create(name);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (error) {
|
|
|
|
if (error == ENOSPC) {
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_record_enospc(FTAG);
|
2010-08-26 22:13:05 +04:00
|
|
|
goto out;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
fatal(0, "dmu_objset_create(%s) = %d", name, error);
|
|
|
|
}
|
|
|
|
|
2017-09-12 23:15:11 +03:00
|
|
|
VERIFY0(ztest_dmu_objset_own(name, DMU_OST_OTHER, B_FALSE, B_TRUE,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
FTAG, &os));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_zd_init(zdtmp, NULL, os);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Open the intent log for it.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
zilog = zil_open(os, ztest_get_data);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Put some objects in there, do a little I/O to them,
|
|
|
|
* and randomly take a couple of snapshots along the way.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
iters = ztest_random(5);
|
2010-08-26 20:52:39 +04:00
|
|
|
for (i = 0; i < iters; i++) {
|
2010-08-26 22:13:05 +04:00
|
|
|
ztest_dmu_object_alloc_free(zdtmp, id);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (ztest_random(iters) == 0)
|
|
|
|
(void) ztest_snapshot_create(name, i);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that we cannot create an existing dataset.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
VERIFY3U(EEXIST, ==,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dmu_objset_create(name, DMU_OST_OTHER, 0, NULL, NULL, NULL));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Verify that we can hold an objset that is also owned.
|
2008-12-03 23:09:06 +03:00
|
|
|
*/
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_objset_hold(name, FTAG, &os2));
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_objset_rele(os2, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Verify that we cannot own an objset that is already owned.
|
|
|
|
*/
|
2017-09-12 23:15:11 +03:00
|
|
|
VERIFY3U(EBUSY, ==, ztest_dmu_objset_own(name, DMU_OST_OTHER,
|
|
|
|
B_FALSE, B_TRUE, FTAG, &os2));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
zil_close(zilog);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dmu_objset_disown(os, B_TRUE, FTAG);
|
2010-08-26 22:13:05 +04:00
|
|
|
ztest_zd_fini(zdtmp);
|
|
|
|
out:
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&ztest_name_lock);
|
2010-08-26 22:13:05 +04:00
|
|
|
|
|
|
|
umem_free(zdtmp, sizeof (ztest_ds_t));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that dmu_snapshot_{create,destroy,open,close} work as expected.
|
|
|
|
*/
|
|
|
|
void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_dmu_snapshot_create_destroy(ztest_ds_t *zd, uint64_t id)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_rdlock(&ztest_name_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) ztest_snapshot_destroy(zd->zd_name, id);
|
|
|
|
(void) ztest_snapshot_create(zd->zd_name, id);
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&ztest_name_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
/*
|
|
|
|
* Cleanup non-standard snapshots and clones.
|
|
|
|
*/
|
2020-06-15 21:30:37 +03:00
|
|
|
static void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_dsl_dataset_cleanup(char *osname, uint64_t id)
|
2009-07-03 02:44:48 +04:00
|
|
|
{
|
2010-08-26 22:13:05 +04:00
|
|
|
char *snap1name;
|
|
|
|
char *clone1name;
|
|
|
|
char *snap2name;
|
|
|
|
char *clone2name;
|
|
|
|
char *snap3name;
|
2009-07-03 02:44:48 +04:00
|
|
|
int error;
|
|
|
|
|
2016-06-16 00:28:36 +03:00
|
|
|
snap1name = umem_alloc(ZFS_MAX_DATASET_NAME_LEN, UMEM_NOFAIL);
|
|
|
|
clone1name = umem_alloc(ZFS_MAX_DATASET_NAME_LEN, UMEM_NOFAIL);
|
|
|
|
snap2name = umem_alloc(ZFS_MAX_DATASET_NAME_LEN, UMEM_NOFAIL);
|
|
|
|
clone2name = umem_alloc(ZFS_MAX_DATASET_NAME_LEN, UMEM_NOFAIL);
|
|
|
|
snap3name = umem_alloc(ZFS_MAX_DATASET_NAME_LEN, UMEM_NOFAIL);
|
|
|
|
|
|
|
|
(void) snprintf(snap1name, ZFS_MAX_DATASET_NAME_LEN,
|
|
|
|
"%s@s1_%llu", osname, (u_longlong_t)id);
|
|
|
|
(void) snprintf(clone1name, ZFS_MAX_DATASET_NAME_LEN,
|
|
|
|
"%s/c1_%llu", osname, (u_longlong_t)id);
|
|
|
|
(void) snprintf(snap2name, ZFS_MAX_DATASET_NAME_LEN,
|
|
|
|
"%s@s2_%llu", clone1name, (u_longlong_t)id);
|
|
|
|
(void) snprintf(clone2name, ZFS_MAX_DATASET_NAME_LEN,
|
|
|
|
"%s/c2_%llu", osname, (u_longlong_t)id);
|
|
|
|
(void) snprintf(snap3name, ZFS_MAX_DATASET_NAME_LEN,
|
|
|
|
"%s@s3_%llu", clone1name, (u_longlong_t)id);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dsl_destroy_head(clone2name);
|
2009-07-03 02:44:48 +04:00
|
|
|
if (error && error != ENOENT)
|
2013-09-04 16:00:57 +04:00
|
|
|
fatal(0, "dsl_destroy_head(%s) = %d", clone2name, error);
|
|
|
|
error = dsl_destroy_snapshot(snap3name, B_FALSE);
|
2009-07-03 02:44:48 +04:00
|
|
|
if (error && error != ENOENT)
|
2013-09-04 16:00:57 +04:00
|
|
|
fatal(0, "dsl_destroy_snapshot(%s) = %d", snap3name, error);
|
|
|
|
error = dsl_destroy_snapshot(snap2name, B_FALSE);
|
2009-07-03 02:44:48 +04:00
|
|
|
if (error && error != ENOENT)
|
2013-09-04 16:00:57 +04:00
|
|
|
fatal(0, "dsl_destroy_snapshot(%s) = %d", snap2name, error);
|
|
|
|
error = dsl_destroy_head(clone1name);
|
2009-07-03 02:44:48 +04:00
|
|
|
if (error && error != ENOENT)
|
2013-09-04 16:00:57 +04:00
|
|
|
fatal(0, "dsl_destroy_head(%s) = %d", clone1name, error);
|
|
|
|
error = dsl_destroy_snapshot(snap1name, B_FALSE);
|
2009-07-03 02:44:48 +04:00
|
|
|
if (error && error != ENOENT)
|
2013-09-04 16:00:57 +04:00
|
|
|
fatal(0, "dsl_destroy_snapshot(%s) = %d", snap1name, error);
|
2010-08-26 22:13:05 +04:00
|
|
|
|
2016-06-16 00:28:36 +03:00
|
|
|
umem_free(snap1name, ZFS_MAX_DATASET_NAME_LEN);
|
|
|
|
umem_free(clone1name, ZFS_MAX_DATASET_NAME_LEN);
|
|
|
|
umem_free(snap2name, ZFS_MAX_DATASET_NAME_LEN);
|
|
|
|
umem_free(clone2name, ZFS_MAX_DATASET_NAME_LEN);
|
|
|
|
umem_free(snap3name, ZFS_MAX_DATASET_NAME_LEN);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify dsl_dataset_promote handles EBUSY
|
|
|
|
*/
|
|
|
|
void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_dsl_dataset_promote_busy(ztest_ds_t *zd, uint64_t id)
|
2009-07-03 02:44:48 +04:00
|
|
|
{
|
2013-09-04 16:00:57 +04:00
|
|
|
objset_t *os;
|
2010-08-26 22:13:05 +04:00
|
|
|
char *snap1name;
|
|
|
|
char *clone1name;
|
|
|
|
char *snap2name;
|
|
|
|
char *clone2name;
|
|
|
|
char *snap3name;
|
2010-05-29 00:45:14 +04:00
|
|
|
char *osname = zd->zd_name;
|
|
|
|
int error;
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2016-06-16 00:28:36 +03:00
|
|
|
snap1name = umem_alloc(ZFS_MAX_DATASET_NAME_LEN, UMEM_NOFAIL);
|
|
|
|
clone1name = umem_alloc(ZFS_MAX_DATASET_NAME_LEN, UMEM_NOFAIL);
|
|
|
|
snap2name = umem_alloc(ZFS_MAX_DATASET_NAME_LEN, UMEM_NOFAIL);
|
|
|
|
clone2name = umem_alloc(ZFS_MAX_DATASET_NAME_LEN, UMEM_NOFAIL);
|
|
|
|
snap3name = umem_alloc(ZFS_MAX_DATASET_NAME_LEN, UMEM_NOFAIL);
|
2010-08-26 22:13:05 +04:00
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_rdlock(&ztest_name_lock);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_dsl_dataset_cleanup(osname, id);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2016-06-16 00:28:36 +03:00
|
|
|
(void) snprintf(snap1name, ZFS_MAX_DATASET_NAME_LEN,
|
|
|
|
"%s@s1_%llu", osname, (u_longlong_t)id);
|
|
|
|
(void) snprintf(clone1name, ZFS_MAX_DATASET_NAME_LEN,
|
|
|
|
"%s/c1_%llu", osname, (u_longlong_t)id);
|
|
|
|
(void) snprintf(snap2name, ZFS_MAX_DATASET_NAME_LEN,
|
|
|
|
"%s@s2_%llu", clone1name, (u_longlong_t)id);
|
|
|
|
(void) snprintf(clone2name, ZFS_MAX_DATASET_NAME_LEN,
|
|
|
|
"%s/c2_%llu", osname, (u_longlong_t)id);
|
|
|
|
(void) snprintf(snap3name, ZFS_MAX_DATASET_NAME_LEN,
|
|
|
|
"%s@s3_%llu", clone1name, (u_longlong_t)id);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2013-08-28 15:45:09 +04:00
|
|
|
error = dmu_objset_snapshot_one(osname, strchr(snap1name, '@') + 1);
|
2009-07-03 02:44:48 +04:00
|
|
|
if (error && error != EEXIST) {
|
|
|
|
if (error == ENOSPC) {
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_record_enospc(FTAG);
|
2009-07-03 02:44:48 +04:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
fatal(0, "dmu_take_snapshot(%s) = %d", snap1name, error);
|
|
|
|
}
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dmu_objset_clone(clone1name, snap1name);
|
2009-07-03 02:44:48 +04:00
|
|
|
if (error) {
|
|
|
|
if (error == ENOSPC) {
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_record_enospc(FTAG);
|
2009-07-03 02:44:48 +04:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
fatal(0, "dmu_objset_create(%s) = %d", clone1name, error);
|
|
|
|
}
|
|
|
|
|
2013-08-28 15:45:09 +04:00
|
|
|
error = dmu_objset_snapshot_one(clone1name, strchr(snap2name, '@') + 1);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error && error != EEXIST) {
|
|
|
|
if (error == ENOSPC) {
|
|
|
|
ztest_record_enospc(FTAG);
|
|
|
|
goto out;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
fatal(0, "dmu_open_snapshot(%s) = %d", snap2name, error);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-08-28 15:45:09 +04:00
|
|
|
error = dmu_objset_snapshot_one(clone1name, strchr(snap3name, '@') + 1);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error && error != EEXIST) {
|
|
|
|
if (error == ENOSPC) {
|
|
|
|
ztest_record_enospc(FTAG);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
fatal(0, "dmu_open_snapshot(%s) = %d", snap3name, error);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dmu_objset_clone(clone2name, snap3name);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error) {
|
|
|
|
if (error == ENOSPC) {
|
|
|
|
ztest_record_enospc(FTAG);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
fatal(0, "dmu_objset_create(%s) = %d", clone2name, error);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-09-12 23:15:11 +03:00
|
|
|
error = ztest_dmu_objset_own(snap2name, DMU_OST_ANY, B_TRUE, B_TRUE,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
FTAG, &os);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error)
|
2013-09-04 16:00:57 +04:00
|
|
|
fatal(0, "dmu_objset_own(%s) = %d", snap2name, error);
|
2010-05-29 00:45:14 +04:00
|
|
|
error = dsl_dataset_promote(clone2name, NULL);
|
2014-06-06 01:19:08 +04:00
|
|
|
if (error == ENOSPC) {
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dmu_objset_disown(os, B_TRUE, FTAG);
|
2014-06-06 01:19:08 +04:00
|
|
|
ztest_record_enospc(FTAG);
|
|
|
|
goto out;
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error != EBUSY)
|
|
|
|
fatal(0, "dsl_dataset_promote(%s), %d, not EBUSY", clone2name,
|
|
|
|
error);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dmu_objset_disown(os, B_TRUE, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
out:
|
|
|
|
ztest_dsl_dataset_cleanup(osname, id);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&ztest_name_lock);
|
2010-08-26 22:13:05 +04:00
|
|
|
|
2016-06-16 00:28:36 +03:00
|
|
|
umem_free(snap1name, ZFS_MAX_DATASET_NAME_LEN);
|
|
|
|
umem_free(clone1name, ZFS_MAX_DATASET_NAME_LEN);
|
|
|
|
umem_free(snap2name, ZFS_MAX_DATASET_NAME_LEN);
|
|
|
|
umem_free(clone2name, ZFS_MAX_DATASET_NAME_LEN);
|
|
|
|
umem_free(snap3name, ZFS_MAX_DATASET_NAME_LEN);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
#undef OD_ARRAY_SIZE
|
2013-11-01 23:26:11 +04:00
|
|
|
#define OD_ARRAY_SIZE 4
|
2010-08-26 22:13:05 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Verify that dmu_object_{alloc,free} work as expected.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ztest_dmu_object_alloc_free(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
2010-08-26 22:13:05 +04:00
|
|
|
ztest_od_t *od;
|
|
|
|
int batchsize;
|
|
|
|
int size;
|
2010-08-26 20:52:39 +04:00
|
|
|
int b;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
size = sizeof (ztest_od_t) * OD_ARRAY_SIZE;
|
2010-08-26 22:13:05 +04:00
|
|
|
od = umem_alloc(size, UMEM_NOFAIL);
|
|
|
|
batchsize = OD_ARRAY_SIZE;
|
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (b = 0; b < batchsize; b++)
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_od_init(od + b, id, FTAG, b, DMU_OT_UINT64_OTHER,
|
|
|
|
0, 0, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Destroy the previous batch of objects, create a new batch,
|
|
|
|
* and do some I/O on the new objects.
|
|
|
|
*/
|
2010-08-26 22:13:05 +04:00
|
|
|
if (ztest_object_init(zd, od, size, B_TRUE) != 0)
|
2010-05-29 00:45:14 +04:00
|
|
|
return;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
while (ztest_random(4 * batchsize) != 0)
|
|
|
|
ztest_io(zd, od[ztest_random(batchsize)].od_object,
|
|
|
|
ztest_random(ZTEST_RANGE_LOCKS) << SPA_MAXBLOCKSHIFT);
|
2010-08-26 22:13:05 +04:00
|
|
|
|
|
|
|
umem_free(od, size);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2018-01-19 12:19:47 +03:00
|
|
|
/*
|
|
|
|
* Rewind the global allocator to verify object allocation backfilling.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ztest_dmu_object_next_chunk(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
|
|
|
objset_t *os = zd->zd_os;
|
|
|
|
int dnodes_per_chunk = 1 << dmu_object_alloc_chunk_shift;
|
|
|
|
uint64_t object;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Rewind the global allocator randomly back to a lower object number
|
|
|
|
* to force backfilling and reclamation of recently freed dnodes.
|
|
|
|
*/
|
|
|
|
mutex_enter(&os->os_obj_lock);
|
|
|
|
object = ztest_random(os->os_obj_next_chunk);
|
|
|
|
os->os_obj_next_chunk = P2ALIGN(object, dnodes_per_chunk);
|
|
|
|
mutex_exit(&os->os_obj_lock);
|
|
|
|
}
|
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
#undef OD_ARRAY_SIZE
|
2013-11-01 23:26:11 +04:00
|
|
|
#define OD_ARRAY_SIZE 2
|
2010-08-26 22:13:05 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Verify that dmu_{read,write} work as expected.
|
|
|
|
*/
|
|
|
|
void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_dmu_read_write(ztest_ds_t *zd, uint64_t id)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-08-26 22:13:05 +04:00
|
|
|
int size;
|
|
|
|
ztest_od_t *od;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *os = zd->zd_os;
|
2013-11-01 23:26:11 +04:00
|
|
|
size = sizeof (ztest_od_t) * OD_ARRAY_SIZE;
|
2010-08-26 22:13:05 +04:00
|
|
|
od = umem_alloc(size, UMEM_NOFAIL);
|
2008-11-20 23:01:55 +03:00
|
|
|
dmu_tx_t *tx;
|
|
|
|
int i, freeit, error;
|
|
|
|
uint64_t n, s, txg;
|
|
|
|
bufwad_t *packbuf, *bigbuf, *pack, *bigH, *bigT;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t packobj, packoff, packsize, bigobj, bigoff, bigsize;
|
|
|
|
uint64_t chunksize = (1000 + ztest_random(1000)) * sizeof (uint64_t);
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t regions = 997;
|
|
|
|
uint64_t stride = 123456789ULL;
|
|
|
|
uint64_t width = 40;
|
|
|
|
int free_percent = 5;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This test uses two objects, packobj and bigobj, that are always
|
|
|
|
* updated together (i.e. in the same tx) so that their contents are
|
|
|
|
* in sync and can be compared. Their contents relate to each other
|
|
|
|
* in a simple way: packobj is a dense array of 'bufwad' structures,
|
|
|
|
* while bigobj is a sparse array of the same bufwads. Specifically,
|
|
|
|
* for any index n, there are three bufwads that should be identical:
|
|
|
|
*
|
|
|
|
* packobj, at offset n * sizeof (bufwad_t)
|
|
|
|
* bigobj, at the head of the nth chunk
|
|
|
|
* bigobj, at the tail of the nth chunk
|
|
|
|
*
|
|
|
|
* The chunk size is arbitrary. It doesn't have to be a power of two,
|
|
|
|
* and it doesn't have any relation to the object blocksize.
|
|
|
|
* The only requirement is that it can hold at least two bufwads.
|
|
|
|
*
|
|
|
|
* Normally, we write the bufwad to each of these locations.
|
|
|
|
* However, free_percent of the time we instead write zeroes to
|
|
|
|
* packobj and perform a dmu_free_range() on bigobj. By comparing
|
|
|
|
* bigobj to packobj, we can verify that the DMU is correctly
|
|
|
|
* tracking which parts of an object are allocated and free,
|
|
|
|
* and that the contents of the allocated blocks are correct.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Read the directory info. If it's the first time, set things up.
|
|
|
|
*/
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_od_init(od, id, FTAG, 0, DMU_OT_UINT64_OTHER, 0, 0, chunksize);
|
|
|
|
ztest_od_init(od + 1, id, FTAG, 1, DMU_OT_UINT64_OTHER, 0, 0,
|
|
|
|
chunksize);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
if (ztest_object_init(zd, od, size, B_FALSE) != 0) {
|
|
|
|
umem_free(od, size);
|
2010-05-29 00:45:14 +04:00
|
|
|
return;
|
2010-08-26 22:13:05 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
bigobj = od[0].od_object;
|
|
|
|
packobj = od[1].od_object;
|
|
|
|
chunksize = od[0].od_gen;
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(chunksize, ==, od[1].od_gen);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Prefetch a random chunk of the big object.
|
|
|
|
* Our aim here is to get some async reads in flight
|
|
|
|
* for blocks that we may free below; the DMU should
|
|
|
|
* handle this race correctly.
|
|
|
|
*/
|
|
|
|
n = ztest_random(regions) * stride + ztest_random(width);
|
|
|
|
s = 1 + ztest_random(2 * width - 1);
|
2015-12-22 04:31:57 +03:00
|
|
|
dmu_prefetch(os, bigobj, 0, n * chunksize, s * chunksize,
|
|
|
|
ZIO_PRIORITY_SYNC_READ);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Pick a random index and compute the offsets into packobj and bigobj.
|
|
|
|
*/
|
|
|
|
n = ztest_random(regions) * stride + ztest_random(width);
|
|
|
|
s = 1 + ztest_random(width - 1);
|
|
|
|
|
|
|
|
packoff = n * sizeof (bufwad_t);
|
|
|
|
packsize = s * sizeof (bufwad_t);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
bigoff = n * chunksize;
|
|
|
|
bigsize = s * chunksize;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
packbuf = umem_alloc(packsize, UMEM_NOFAIL);
|
|
|
|
bigbuf = umem_alloc(bigsize, UMEM_NOFAIL);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* free_percent of the time, free a range of bigobj rather than
|
|
|
|
* overwriting it.
|
|
|
|
*/
|
|
|
|
freeit = (ztest_random(100) < free_percent);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Read the current contents of our objects.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
error = dmu_read(os, packobj, packoff, packsize, packbuf,
|
2009-07-03 02:44:48 +04:00
|
|
|
DMU_READ_PREFETCH);
|
2013-05-11 01:17:03 +04:00
|
|
|
ASSERT0(error);
|
2010-05-29 00:45:14 +04:00
|
|
|
error = dmu_read(os, bigobj, bigoff, bigsize, bigbuf,
|
2009-07-03 02:44:48 +04:00
|
|
|
DMU_READ_PREFETCH);
|
2013-05-11 01:17:03 +04:00
|
|
|
ASSERT0(error);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Get a tx for the mods to both packobj and bigobj.
|
|
|
|
*/
|
|
|
|
tx = dmu_tx_create(os);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_tx_hold_write(tx, packobj, packoff, packsize);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (freeit)
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_tx_hold_free(tx, bigobj, bigoff, bigsize);
|
2008-11-20 23:01:55 +03:00
|
|
|
else
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_tx_hold_write(tx, bigobj, bigoff, bigsize);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-08-07 22:32:46 +04:00
|
|
|
/* This accounts for setting the checksum/compression. */
|
|
|
|
dmu_tx_hold_bonus(tx, bigobj);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
txg = ztest_tx_assign(tx, TXG_MIGHTWAIT, FTAG);
|
|
|
|
if (txg == 0) {
|
2008-11-20 23:01:55 +03:00
|
|
|
umem_free(packbuf, packsize);
|
|
|
|
umem_free(bigbuf, bigsize);
|
2010-08-26 22:13:05 +04:00
|
|
|
umem_free(od, size);
|
2008-11-20 23:01:55 +03:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
enum zio_checksum cksum;
|
|
|
|
do {
|
|
|
|
cksum = (enum zio_checksum)
|
|
|
|
ztest_random_dsl_prop(ZFS_PROP_CHECKSUM);
|
|
|
|
} while (cksum >= ZIO_CHECKSUM_LEGACY_FUNCTIONS);
|
|
|
|
dmu_object_set_checksum(os, bigobj, cksum, tx);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
enum zio_compress comp;
|
|
|
|
do {
|
|
|
|
comp = (enum zio_compress)
|
|
|
|
ztest_random_dsl_prop(ZFS_PROP_COMPRESSION);
|
|
|
|
} while (comp >= ZIO_COMPRESS_LEGACY_FUNCTIONS);
|
|
|
|
dmu_object_set_compress(os, bigobj, comp, tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* For each index from n to n + s, verify that the existing bufwad
|
|
|
|
* in packobj matches the bufwads at the head and tail of the
|
|
|
|
* corresponding chunk in bigobj. Then update all three bufwads
|
|
|
|
* with the new values we want to write out.
|
|
|
|
*/
|
|
|
|
for (i = 0; i < s; i++) {
|
|
|
|
/* LINTED */
|
|
|
|
pack = (bufwad_t *)((char *)packbuf + i * sizeof (bufwad_t));
|
|
|
|
/* LINTED */
|
2010-05-29 00:45:14 +04:00
|
|
|
bigH = (bufwad_t *)((char *)bigbuf + i * chunksize);
|
2008-11-20 23:01:55 +03:00
|
|
|
/* LINTED */
|
2010-05-29 00:45:14 +04:00
|
|
|
bigT = (bufwad_t *)((char *)bigH + chunksize) - 1;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U((uintptr_t)bigH - (uintptr_t)bigbuf, <, bigsize);
|
|
|
|
ASSERT3U((uintptr_t)bigT - (uintptr_t)bigbuf, <, bigsize);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (pack->bw_txg > txg)
|
|
|
|
fatal(0, "future leak: got %llx, open txg is %llx",
|
|
|
|
pack->bw_txg, txg);
|
|
|
|
|
|
|
|
if (pack->bw_data != 0 && pack->bw_index != n + i)
|
|
|
|
fatal(0, "wrong index: got %llx, wanted %llx+%llx",
|
|
|
|
pack->bw_index, n, i);
|
|
|
|
|
|
|
|
if (bcmp(pack, bigH, sizeof (bufwad_t)) != 0)
|
|
|
|
fatal(0, "pack/bigH mismatch in %p/%p", pack, bigH);
|
|
|
|
|
|
|
|
if (bcmp(pack, bigT, sizeof (bufwad_t)) != 0)
|
|
|
|
fatal(0, "pack/bigT mismatch in %p/%p", pack, bigT);
|
|
|
|
|
|
|
|
if (freeit) {
|
|
|
|
bzero(pack, sizeof (bufwad_t));
|
|
|
|
} else {
|
|
|
|
pack->bw_index = n + i;
|
|
|
|
pack->bw_txg = txg;
|
|
|
|
pack->bw_data = 1 + ztest_random(-2ULL);
|
|
|
|
}
|
|
|
|
*bigH = *pack;
|
|
|
|
*bigT = *pack;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We've verified all the old bufwads, and made new ones.
|
|
|
|
* Now write them out.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_write(os, packobj, packoff, packsize, packbuf, tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (freeit) {
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 7) {
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("freeing offset %llx size %llx"
|
|
|
|
" txg %llx\n",
|
|
|
|
(u_longlong_t)bigoff,
|
|
|
|
(u_longlong_t)bigsize,
|
|
|
|
(u_longlong_t)txg);
|
|
|
|
}
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_free_range(os, bigobj, bigoff, bigsize, tx));
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 7) {
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("writing offset %llx size %llx"
|
|
|
|
" txg %llx\n",
|
|
|
|
(u_longlong_t)bigoff,
|
|
|
|
(u_longlong_t)bigsize,
|
|
|
|
(u_longlong_t)txg);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_write(os, bigobj, bigoff, bigsize, bigbuf, tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
dmu_tx_commit(tx);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Sanity check the stuff we just wrote.
|
|
|
|
*/
|
|
|
|
{
|
|
|
|
void *packcheck = umem_alloc(packsize, UMEM_NOFAIL);
|
|
|
|
void *bigcheck = umem_alloc(bigsize, UMEM_NOFAIL);
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_read(os, packobj, packoff,
|
2009-07-03 02:44:48 +04:00
|
|
|
packsize, packcheck, DMU_READ_PREFETCH));
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_read(os, bigobj, bigoff,
|
2009-07-03 02:44:48 +04:00
|
|
|
bigsize, bigcheck, DMU_READ_PREFETCH));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT0(bcmp(packbuf, packcheck, packsize));
|
|
|
|
ASSERT0(bcmp(bigbuf, bigcheck, bigsize));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
umem_free(packcheck, packsize);
|
|
|
|
umem_free(bigcheck, bigsize);
|
|
|
|
}
|
|
|
|
|
|
|
|
umem_free(packbuf, packsize);
|
|
|
|
umem_free(bigbuf, bigsize);
|
2010-08-26 22:13:05 +04:00
|
|
|
umem_free(od, size);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2020-06-15 21:30:37 +03:00
|
|
|
static void
|
2009-07-03 02:44:48 +04:00
|
|
|
compare_and_update_pbbufs(uint64_t s, bufwad_t *packbuf, bufwad_t *bigbuf,
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t bigsize, uint64_t n, uint64_t chunksize, uint64_t txg)
|
2009-07-03 02:44:48 +04:00
|
|
|
{
|
|
|
|
uint64_t i;
|
|
|
|
bufwad_t *pack;
|
|
|
|
bufwad_t *bigH;
|
|
|
|
bufwad_t *bigT;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* For each index from n to n + s, verify that the existing bufwad
|
|
|
|
* in packobj matches the bufwads at the head and tail of the
|
|
|
|
* corresponding chunk in bigobj. Then update all three bufwads
|
|
|
|
* with the new values we want to write out.
|
|
|
|
*/
|
|
|
|
for (i = 0; i < s; i++) {
|
|
|
|
/* LINTED */
|
|
|
|
pack = (bufwad_t *)((char *)packbuf + i * sizeof (bufwad_t));
|
|
|
|
/* LINTED */
|
2010-05-29 00:45:14 +04:00
|
|
|
bigH = (bufwad_t *)((char *)bigbuf + i * chunksize);
|
2009-07-03 02:44:48 +04:00
|
|
|
/* LINTED */
|
2010-05-29 00:45:14 +04:00
|
|
|
bigT = (bufwad_t *)((char *)bigH + chunksize) - 1;
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U((uintptr_t)bigH - (uintptr_t)bigbuf, <, bigsize);
|
|
|
|
ASSERT3U((uintptr_t)bigT - (uintptr_t)bigbuf, <, bigsize);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
if (pack->bw_txg > txg)
|
|
|
|
fatal(0, "future leak: got %llx, open txg is %llx",
|
|
|
|
pack->bw_txg, txg);
|
|
|
|
|
|
|
|
if (pack->bw_data != 0 && pack->bw_index != n + i)
|
|
|
|
fatal(0, "wrong index: got %llx, wanted %llx+%llx",
|
|
|
|
pack->bw_index, n, i);
|
|
|
|
|
|
|
|
if (bcmp(pack, bigH, sizeof (bufwad_t)) != 0)
|
|
|
|
fatal(0, "pack/bigH mismatch in %p/%p", pack, bigH);
|
|
|
|
|
|
|
|
if (bcmp(pack, bigT, sizeof (bufwad_t)) != 0)
|
|
|
|
fatal(0, "pack/bigT mismatch in %p/%p", pack, bigT);
|
|
|
|
|
|
|
|
pack->bw_index = n + i;
|
|
|
|
pack->bw_txg = txg;
|
|
|
|
pack->bw_data = 1 + ztest_random(-2ULL);
|
|
|
|
|
|
|
|
*bigH = *pack;
|
|
|
|
*bigT = *pack;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
#undef OD_ARRAY_SIZE
|
2013-11-01 23:26:11 +04:00
|
|
|
#define OD_ARRAY_SIZE 2
|
2010-08-26 22:13:05 +04:00
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_dmu_read_write_zcopy(ztest_ds_t *zd, uint64_t id)
|
2009-07-03 02:44:48 +04:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *os = zd->zd_os;
|
2010-08-26 22:13:05 +04:00
|
|
|
ztest_od_t *od;
|
2009-07-03 02:44:48 +04:00
|
|
|
dmu_tx_t *tx;
|
|
|
|
uint64_t i;
|
|
|
|
int error;
|
2010-08-26 22:13:05 +04:00
|
|
|
int size;
|
2009-07-03 02:44:48 +04:00
|
|
|
uint64_t n, s, txg;
|
|
|
|
bufwad_t *packbuf, *bigbuf;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t packobj, packoff, packsize, bigobj, bigoff, bigsize;
|
|
|
|
uint64_t blocksize = ztest_random_blocksize();
|
|
|
|
uint64_t chunksize = blocksize;
|
2009-07-03 02:44:48 +04:00
|
|
|
uint64_t regions = 997;
|
|
|
|
uint64_t stride = 123456789ULL;
|
|
|
|
uint64_t width = 9;
|
|
|
|
dmu_buf_t *bonus_db;
|
|
|
|
arc_buf_t **bigbuf_arcbufs;
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_object_info_t doi;
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
size = sizeof (ztest_od_t) * OD_ARRAY_SIZE;
|
2010-08-26 22:13:05 +04:00
|
|
|
od = umem_alloc(size, UMEM_NOFAIL);
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
/*
|
|
|
|
* This test uses two objects, packobj and bigobj, that are always
|
|
|
|
* updated together (i.e. in the same tx) so that their contents are
|
|
|
|
* in sync and can be compared. Their contents relate to each other
|
|
|
|
* in a simple way: packobj is a dense array of 'bufwad' structures,
|
|
|
|
* while bigobj is a sparse array of the same bufwads. Specifically,
|
|
|
|
* for any index n, there are three bufwads that should be identical:
|
|
|
|
*
|
|
|
|
* packobj, at offset n * sizeof (bufwad_t)
|
|
|
|
* bigobj, at the head of the nth chunk
|
|
|
|
* bigobj, at the tail of the nth chunk
|
|
|
|
*
|
|
|
|
* The chunk size is set equal to bigobj block size so that
|
2017-09-28 18:49:13 +03:00
|
|
|
* dmu_assign_arcbuf_by_dbuf() can be tested for object updates.
|
2009-07-03 02:44:48 +04:00
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Read the directory info. If it's the first time, set things up.
|
|
|
|
*/
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_od_init(od, id, FTAG, 0, DMU_OT_UINT64_OTHER, blocksize, 0, 0);
|
|
|
|
ztest_od_init(od + 1, id, FTAG, 1, DMU_OT_UINT64_OTHER, 0, 0,
|
|
|
|
chunksize);
|
2010-08-26 22:13:05 +04:00
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
if (ztest_object_init(zd, od, size, B_FALSE) != 0) {
|
|
|
|
umem_free(od, size);
|
2010-05-29 00:45:14 +04:00
|
|
|
return;
|
2010-08-26 22:13:05 +04:00
|
|
|
}
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
bigobj = od[0].od_object;
|
|
|
|
packobj = od[1].od_object;
|
|
|
|
blocksize = od[0].od_blocksize;
|
|
|
|
chunksize = blocksize;
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(chunksize, ==, od[1].od_gen);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_object_info(os, bigobj, &doi));
|
2010-05-29 00:45:14 +04:00
|
|
|
VERIFY(ISP2(doi.doi_data_block_size));
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY3U(chunksize, ==, doi.doi_data_block_size);
|
|
|
|
VERIFY3U(chunksize, >=, 2 * sizeof (bufwad_t));
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Pick a random index and compute the offsets into packobj and bigobj.
|
|
|
|
*/
|
|
|
|
n = ztest_random(regions) * stride + ztest_random(width);
|
|
|
|
s = 1 + ztest_random(width - 1);
|
|
|
|
|
|
|
|
packoff = n * sizeof (bufwad_t);
|
|
|
|
packsize = s * sizeof (bufwad_t);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
bigoff = n * chunksize;
|
|
|
|
bigsize = s * chunksize;
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
packbuf = umem_zalloc(packsize, UMEM_NOFAIL);
|
|
|
|
bigbuf = umem_zalloc(bigsize, UMEM_NOFAIL);
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_bonus_hold(os, bigobj, FTAG, &bonus_db));
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
bigbuf_arcbufs = umem_zalloc(2 * s * sizeof (arc_buf_t *), UMEM_NOFAIL);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Iteration 0 test zcopy for DB_UNCACHED dbufs.
|
|
|
|
* Iteration 1 test zcopy to already referenced dbufs.
|
|
|
|
* Iteration 2 test zcopy to dirty dbuf in the same txg.
|
|
|
|
* Iteration 3 test zcopy to dbuf dirty in previous txg.
|
|
|
|
* Iteration 4 test zcopy when dbuf is no longer dirty.
|
|
|
|
* Iteration 5 test zcopy when it can't be done.
|
|
|
|
* Iteration 6 one more zcopy write.
|
|
|
|
*/
|
|
|
|
for (i = 0; i < 7; i++) {
|
|
|
|
uint64_t j;
|
|
|
|
uint64_t off;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* In iteration 5 (i == 5) use arcbufs
|
|
|
|
* that don't match bigobj blksz to test
|
2017-09-28 18:49:13 +03:00
|
|
|
* dmu_assign_arcbuf_by_dbuf() when it can't directly
|
2009-07-03 02:44:48 +04:00
|
|
|
* assign an arcbuf to a dbuf.
|
|
|
|
*/
|
|
|
|
for (j = 0; j < s; j++) {
|
2014-12-30 06:12:23 +03:00
|
|
|
if (i != 5 || chunksize < (SPA_MINBLOCKSIZE * 2)) {
|
2009-07-03 02:44:48 +04:00
|
|
|
bigbuf_arcbufs[j] =
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_request_arcbuf(bonus_db, chunksize);
|
2009-07-03 02:44:48 +04:00
|
|
|
} else {
|
|
|
|
bigbuf_arcbufs[2 * j] =
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_request_arcbuf(bonus_db, chunksize / 2);
|
2009-07-03 02:44:48 +04:00
|
|
|
bigbuf_arcbufs[2 * j + 1] =
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_request_arcbuf(bonus_db, chunksize / 2);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Get a tx for the mods to both packobj and bigobj.
|
|
|
|
*/
|
|
|
|
tx = dmu_tx_create(os);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_tx_hold_write(tx, packobj, packoff, packsize);
|
|
|
|
dmu_tx_hold_write(tx, bigobj, bigoff, bigsize);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
txg = ztest_tx_assign(tx, TXG_MIGHTWAIT, FTAG);
|
|
|
|
if (txg == 0) {
|
2009-07-03 02:44:48 +04:00
|
|
|
umem_free(packbuf, packsize);
|
|
|
|
umem_free(bigbuf, bigsize);
|
|
|
|
for (j = 0; j < s; j++) {
|
2014-12-30 06:12:23 +03:00
|
|
|
if (i != 5 ||
|
|
|
|
chunksize < (SPA_MINBLOCKSIZE * 2)) {
|
2009-07-03 02:44:48 +04:00
|
|
|
dmu_return_arcbuf(bigbuf_arcbufs[j]);
|
|
|
|
} else {
|
|
|
|
dmu_return_arcbuf(
|
|
|
|
bigbuf_arcbufs[2 * j]);
|
|
|
|
dmu_return_arcbuf(
|
|
|
|
bigbuf_arcbufs[2 * j + 1]);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
umem_free(bigbuf_arcbufs, 2 * s * sizeof (arc_buf_t *));
|
2010-08-26 22:13:05 +04:00
|
|
|
umem_free(od, size);
|
2009-07-03 02:44:48 +04:00
|
|
|
dmu_buf_rele(bonus_db, FTAG);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* 50% of the time don't read objects in the 1st iteration to
|
2017-09-28 18:49:13 +03:00
|
|
|
* test dmu_assign_arcbuf_by_dbuf() for the case when there are
|
|
|
|
* no existing dbufs for the specified offsets.
|
2009-07-03 02:44:48 +04:00
|
|
|
*/
|
|
|
|
if (i != 0 || ztest_random(2) != 0) {
|
2010-05-29 00:45:14 +04:00
|
|
|
error = dmu_read(os, packobj, packoff,
|
2009-07-03 02:44:48 +04:00
|
|
|
packsize, packbuf, DMU_READ_PREFETCH);
|
2013-05-11 01:17:03 +04:00
|
|
|
ASSERT0(error);
|
2010-05-29 00:45:14 +04:00
|
|
|
error = dmu_read(os, bigobj, bigoff, bigsize,
|
2009-07-03 02:44:48 +04:00
|
|
|
bigbuf, DMU_READ_PREFETCH);
|
2013-05-11 01:17:03 +04:00
|
|
|
ASSERT0(error);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
compare_and_update_pbbufs(s, packbuf, bigbuf, bigsize,
|
2010-05-29 00:45:14 +04:00
|
|
|
n, chunksize, txg);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We've verified all the old bufwads, and made new ones.
|
|
|
|
* Now write them out.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_write(os, packobj, packoff, packsize, packbuf, tx);
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 7) {
|
2009-07-03 02:44:48 +04:00
|
|
|
(void) printf("writing offset %llx size %llx"
|
|
|
|
" txg %llx\n",
|
|
|
|
(u_longlong_t)bigoff,
|
|
|
|
(u_longlong_t)bigsize,
|
|
|
|
(u_longlong_t)txg);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
for (off = bigoff, j = 0; j < s; j++, off += chunksize) {
|
2009-07-03 02:44:48 +04:00
|
|
|
dmu_buf_t *dbt;
|
2014-12-30 06:12:23 +03:00
|
|
|
if (i != 5 || chunksize < (SPA_MINBLOCKSIZE * 2)) {
|
2009-07-03 02:44:48 +04:00
|
|
|
bcopy((caddr_t)bigbuf + (off - bigoff),
|
2010-05-29 00:45:14 +04:00
|
|
|
bigbuf_arcbufs[j]->b_data, chunksize);
|
2009-07-03 02:44:48 +04:00
|
|
|
} else {
|
|
|
|
bcopy((caddr_t)bigbuf + (off - bigoff),
|
|
|
|
bigbuf_arcbufs[2 * j]->b_data,
|
2010-05-29 00:45:14 +04:00
|
|
|
chunksize / 2);
|
2009-07-03 02:44:48 +04:00
|
|
|
bcopy((caddr_t)bigbuf + (off - bigoff) +
|
2010-05-29 00:45:14 +04:00
|
|
|
chunksize / 2,
|
2009-07-03 02:44:48 +04:00
|
|
|
bigbuf_arcbufs[2 * j + 1]->b_data,
|
2010-05-29 00:45:14 +04:00
|
|
|
chunksize / 2);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
if (i == 1) {
|
2010-05-29 00:45:14 +04:00
|
|
|
VERIFY(dmu_buf_hold(os, bigobj, off,
|
|
|
|
FTAG, &dbt, DMU_READ_NO_PREFETCH) == 0);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
2014-12-30 06:12:23 +03:00
|
|
|
if (i != 5 || chunksize < (SPA_MINBLOCKSIZE * 2)) {
|
2019-01-18 02:47:08 +03:00
|
|
|
VERIFY0(dmu_assign_arcbuf_by_dbuf(bonus_db,
|
|
|
|
off, bigbuf_arcbufs[j], tx));
|
2009-07-03 02:44:48 +04:00
|
|
|
} else {
|
2019-01-18 02:47:08 +03:00
|
|
|
VERIFY0(dmu_assign_arcbuf_by_dbuf(bonus_db,
|
|
|
|
off, bigbuf_arcbufs[2 * j], tx));
|
|
|
|
VERIFY0(dmu_assign_arcbuf_by_dbuf(bonus_db,
|
2010-05-29 00:45:14 +04:00
|
|
|
off + chunksize / 2,
|
2019-01-18 02:47:08 +03:00
|
|
|
bigbuf_arcbufs[2 * j + 1], tx));
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
if (i == 1) {
|
|
|
|
dmu_buf_rele(dbt, FTAG);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
dmu_tx_commit(tx);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Sanity check the stuff we just wrote.
|
|
|
|
*/
|
|
|
|
{
|
|
|
|
void *packcheck = umem_alloc(packsize, UMEM_NOFAIL);
|
|
|
|
void *bigcheck = umem_alloc(bigsize, UMEM_NOFAIL);
|
|
|
|
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
VERIFY0(dmu_read(os, packobj, packoff,
|
2009-07-03 02:44:48 +04:00
|
|
|
packsize, packcheck, DMU_READ_PREFETCH));
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
VERIFY0(dmu_read(os, bigobj, bigoff,
|
2009-07-03 02:44:48 +04:00
|
|
|
bigsize, bigcheck, DMU_READ_PREFETCH));
|
|
|
|
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
ASSERT0(bcmp(packbuf, packcheck, packsize));
|
|
|
|
ASSERT0(bcmp(bigbuf, bigcheck, bigsize));
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
umem_free(packcheck, packsize);
|
|
|
|
umem_free(bigcheck, bigsize);
|
|
|
|
}
|
|
|
|
if (i == 2) {
|
2019-03-29 19:13:20 +03:00
|
|
|
txg_wait_open(dmu_objset_pool(os), 0, B_TRUE);
|
2009-07-03 02:44:48 +04:00
|
|
|
} else if (i == 3) {
|
|
|
|
txg_wait_synced(dmu_objset_pool(os), 0);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
dmu_buf_rele(bonus_db, FTAG);
|
|
|
|
umem_free(packbuf, packsize);
|
|
|
|
umem_free(bigbuf, bigsize);
|
|
|
|
umem_free(bigbuf_arcbufs, 2 * s * sizeof (arc_buf_t *));
|
2010-08-26 22:13:05 +04:00
|
|
|
umem_free(od, size);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/* ARGSUSED */
|
2008-11-20 23:01:55 +03:00
|
|
|
void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_dmu_write_parallel(ztest_ds_t *zd, uint64_t id)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-08-26 22:13:05 +04:00
|
|
|
ztest_od_t *od;
|
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
od = umem_alloc(sizeof (ztest_od_t), UMEM_NOFAIL);
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t offset = (1ULL << (ztest_random(20) + 43)) +
|
|
|
|
(ztest_random(ZTEST_RANGE_LOCKS) << SPA_MAXBLOCKSHIFT);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Have multiple threads write to large offsets in an object
|
|
|
|
* to verify that parallel writes to an object -- even to the
|
|
|
|
* same blocks within the object -- doesn't cause any trouble.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_od_init(od, ID_PARALLEL, FTAG, 0, DMU_OT_UINT64_OTHER, 0, 0, 0);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
if (ztest_object_init(zd, od, sizeof (ztest_od_t), B_FALSE) != 0)
|
2008-11-20 23:01:55 +03:00
|
|
|
return;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
while (ztest_random(10) != 0)
|
2010-08-26 22:13:05 +04:00
|
|
|
ztest_io(zd, od->od_object, offset);
|
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
umem_free(od, sizeof (ztest_od_t));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
void
|
|
|
|
ztest_dmu_prealloc(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
2010-08-26 22:13:05 +04:00
|
|
|
ztest_od_t *od;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t offset = (1ULL << (ztest_random(4) + SPA_MAXBLOCKSHIFT)) +
|
|
|
|
(ztest_random(ZTEST_RANGE_LOCKS) << SPA_MAXBLOCKSHIFT);
|
|
|
|
uint64_t count = ztest_random(20) + 1;
|
|
|
|
uint64_t blocksize = ztest_random_blocksize();
|
|
|
|
void *data;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
od = umem_alloc(sizeof (ztest_od_t), UMEM_NOFAIL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_od_init(od, id, FTAG, 0, DMU_OT_UINT64_OTHER, blocksize, 0, 0);
|
2010-08-26 22:13:05 +04:00
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
if (ztest_object_init(zd, od, sizeof (ztest_od_t),
|
|
|
|
!ztest_random(2)) != 0) {
|
|
|
|
umem_free(od, sizeof (ztest_od_t));
|
2008-11-20 23:01:55 +03:00
|
|
|
return;
|
2010-08-26 22:13:05 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
if (ztest_truncate(zd, od->od_object, offset, count * blocksize) != 0) {
|
2013-11-01 23:26:11 +04:00
|
|
|
umem_free(od, sizeof (ztest_od_t));
|
2008-11-20 23:01:55 +03:00
|
|
|
return;
|
2010-08-26 22:13:05 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
ztest_prealloc(zd, od->od_object, offset, count * blocksize);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
data = umem_zalloc(blocksize, UMEM_NOFAIL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
while (ztest_random(count) != 0) {
|
|
|
|
uint64_t randoff = offset + (ztest_random(count) * blocksize);
|
2010-08-26 22:13:05 +04:00
|
|
|
if (ztest_write(zd, od->od_object, randoff, blocksize,
|
2010-05-29 00:45:14 +04:00
|
|
|
data) != 0)
|
|
|
|
break;
|
|
|
|
while (ztest_random(4) != 0)
|
2010-08-26 22:13:05 +04:00
|
|
|
ztest_io(zd, od->od_object, randoff);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
umem_free(data, blocksize);
|
2013-11-01 23:26:11 +04:00
|
|
|
umem_free(od, sizeof (ztest_od_t));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that zap_{create,destroy,add,remove,update} work as expected.
|
|
|
|
*/
|
|
|
|
#define ZTEST_ZAP_MIN_INTS 1
|
|
|
|
#define ZTEST_ZAP_MAX_INTS 4
|
|
|
|
#define ZTEST_ZAP_MAX_PROPS 1000
|
|
|
|
|
|
|
|
void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_zap(ztest_ds_t *zd, uint64_t id)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *os = zd->zd_os;
|
2010-08-26 22:13:05 +04:00
|
|
|
ztest_od_t *od;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t object;
|
|
|
|
uint64_t txg, last_txg;
|
|
|
|
uint64_t value[ZTEST_ZAP_MAX_INTS];
|
|
|
|
uint64_t zl_ints, zl_intsize, prop;
|
|
|
|
int i, ints;
|
|
|
|
dmu_tx_t *tx;
|
|
|
|
char propname[100], txgname[100];
|
|
|
|
int error;
|
|
|
|
char *hc[2] = { "s.acl.h", ".s.open.h.hyLZlg" };
|
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
od = umem_alloc(sizeof (ztest_od_t), UMEM_NOFAIL);
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_od_init(od, id, FTAG, 0, DMU_OT_ZAP_OTHER, 0, 0, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
if (ztest_object_init(zd, od, sizeof (ztest_od_t),
|
2016-12-12 21:46:26 +03:00
|
|
|
!ztest_random(2)) != 0)
|
2010-08-26 22:13:05 +04:00
|
|
|
goto out;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
object = od->od_object;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Generate a known hash collision, and verify that
|
|
|
|
* we can lookup and remove both entries.
|
|
|
|
*/
|
|
|
|
tx = dmu_tx_create(os);
|
|
|
|
dmu_tx_hold_zap(tx, object, B_TRUE, NULL);
|
|
|
|
txg = ztest_tx_assign(tx, TXG_MIGHTWAIT, FTAG);
|
|
|
|
if (txg == 0)
|
2010-08-26 22:13:05 +04:00
|
|
|
goto out;
|
2010-05-29 00:45:14 +04:00
|
|
|
for (i = 0; i < 2; i++) {
|
|
|
|
value[i] = i;
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(zap_add(os, object, hc[i], sizeof (uint64_t),
|
2010-05-29 00:45:14 +04:00
|
|
|
1, &value[i], tx));
|
|
|
|
}
|
|
|
|
for (i = 0; i < 2; i++) {
|
|
|
|
VERIFY3U(EEXIST, ==, zap_add(os, object, hc[i],
|
|
|
|
sizeof (uint64_t), 1, &value[i], tx));
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(
|
2010-05-29 00:45:14 +04:00
|
|
|
zap_length(os, object, hc[i], &zl_intsize, &zl_ints));
|
|
|
|
ASSERT3U(zl_intsize, ==, sizeof (uint64_t));
|
|
|
|
ASSERT3U(zl_ints, ==, 1);
|
|
|
|
}
|
|
|
|
for (i = 0; i < 2; i++) {
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(zap_remove(os, object, hc[i], tx));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_tx_commit(tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
2019-01-13 21:11:52 +03:00
|
|
|
* Generate a bunch of random entries.
|
2010-05-29 00:45:14 +04:00
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
ints = MAX(ZTEST_ZAP_MIN_INTS, object % ZTEST_ZAP_MAX_INTS);
|
|
|
|
|
|
|
|
prop = ztest_random(ZTEST_ZAP_MAX_PROPS);
|
|
|
|
(void) sprintf(propname, "prop_%llu", (u_longlong_t)prop);
|
|
|
|
(void) sprintf(txgname, "txg_%llu", (u_longlong_t)prop);
|
|
|
|
bzero(value, sizeof (value));
|
|
|
|
last_txg = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If these zap entries already exist, validate their contents.
|
|
|
|
*/
|
|
|
|
error = zap_length(os, object, txgname, &zl_intsize, &zl_ints);
|
|
|
|
if (error == 0) {
|
|
|
|
ASSERT3U(zl_intsize, ==, sizeof (uint64_t));
|
|
|
|
ASSERT3U(zl_ints, ==, 1);
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(zap_lookup(os, object, txgname, zl_intsize,
|
|
|
|
zl_ints, &last_txg));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(zap_length(os, object, propname, &zl_intsize,
|
|
|
|
&zl_ints));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT3U(zl_intsize, ==, sizeof (uint64_t));
|
|
|
|
ASSERT3U(zl_ints, ==, ints);
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(zap_lookup(os, object, propname, zl_intsize,
|
|
|
|
zl_ints, value));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
for (i = 0; i < ints; i++) {
|
|
|
|
ASSERT3U(value[i], ==, last_txg + object + i);
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
ASSERT3U(error, ==, ENOENT);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Atomically update two entries in our zap object.
|
|
|
|
* The first is named txg_%llu, and contains the txg
|
|
|
|
* in which the property was last updated. The second
|
|
|
|
* is named prop_%llu, and the nth element of its value
|
|
|
|
* should be txg + object + n.
|
|
|
|
*/
|
|
|
|
tx = dmu_tx_create(os);
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_tx_hold_zap(tx, object, B_TRUE, NULL);
|
|
|
|
txg = ztest_tx_assign(tx, TXG_MIGHTWAIT, FTAG);
|
|
|
|
if (txg == 0)
|
2010-08-26 22:13:05 +04:00
|
|
|
goto out;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (last_txg > txg)
|
|
|
|
fatal(0, "zap future leak: old %llu new %llu", last_txg, txg);
|
|
|
|
|
|
|
|
for (i = 0; i < ints; i++)
|
|
|
|
value[i] = txg + object + i;
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(zap_update(os, object, txgname, sizeof (uint64_t),
|
2010-05-29 00:45:14 +04:00
|
|
|
1, &txg, tx));
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(zap_update(os, object, propname, sizeof (uint64_t),
|
2010-05-29 00:45:14 +04:00
|
|
|
ints, value, tx));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
dmu_tx_commit(tx);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remove a random pair of entries.
|
|
|
|
*/
|
|
|
|
prop = ztest_random(ZTEST_ZAP_MAX_PROPS);
|
|
|
|
(void) sprintf(propname, "prop_%llu", (u_longlong_t)prop);
|
|
|
|
(void) sprintf(txgname, "txg_%llu", (u_longlong_t)prop);
|
|
|
|
|
|
|
|
error = zap_length(os, object, txgname, &zl_intsize, &zl_ints);
|
|
|
|
|
|
|
|
if (error == ENOENT)
|
2010-08-26 22:13:05 +04:00
|
|
|
goto out;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-05-11 01:17:03 +04:00
|
|
|
ASSERT0(error);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
tx = dmu_tx_create(os);
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_tx_hold_zap(tx, object, B_TRUE, NULL);
|
|
|
|
txg = ztest_tx_assign(tx, TXG_MIGHTWAIT, FTAG);
|
|
|
|
if (txg == 0)
|
2010-08-26 22:13:05 +04:00
|
|
|
goto out;
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(zap_remove(os, object, txgname, tx));
|
|
|
|
VERIFY0(zap_remove(os, object, propname, tx));
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_tx_commit(tx);
|
2010-08-26 22:13:05 +04:00
|
|
|
out:
|
2013-11-01 23:26:11 +04:00
|
|
|
umem_free(od, sizeof (ztest_od_t));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
2019-01-13 21:11:52 +03:00
|
|
|
* Test case to test the upgrading of a microzap to fatzap.
|
2010-05-29 00:45:14 +04:00
|
|
|
*/
|
|
|
|
void
|
|
|
|
ztest_fzap(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
|
|
|
objset_t *os = zd->zd_os;
|
2010-08-26 22:13:05 +04:00
|
|
|
ztest_od_t *od;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t object, txg;
|
2010-08-26 20:52:39 +04:00
|
|
|
int i;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
od = umem_alloc(sizeof (ztest_od_t), UMEM_NOFAIL);
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_od_init(od, id, FTAG, 0, DMU_OT_ZAP_OTHER, 0, 0, 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
if (ztest_object_init(zd, od, sizeof (ztest_od_t),
|
2016-12-12 21:46:26 +03:00
|
|
|
!ztest_random(2)) != 0)
|
2010-08-26 22:13:05 +04:00
|
|
|
goto out;
|
|
|
|
object = od->od_object;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Add entries to this ZAP and make sure it spills over
|
|
|
|
* and gets upgraded to a fatzap. Also, since we are adding
|
|
|
|
* 2050 entries we should see ptrtbl growth and leaf-block split.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2010-08-26 20:52:39 +04:00
|
|
|
for (i = 0; i < 2050; i++) {
|
2016-06-16 00:28:36 +03:00
|
|
|
char name[ZFS_MAX_DATASET_NAME_LEN];
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t value = i;
|
|
|
|
dmu_tx_t *tx;
|
|
|
|
int error;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) snprintf(name, sizeof (name), "fzap-%llu-%llu",
|
2010-08-26 20:52:39 +04:00
|
|
|
(u_longlong_t)id, (u_longlong_t)value);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
tx = dmu_tx_create(os);
|
|
|
|
dmu_tx_hold_zap(tx, object, B_TRUE, name);
|
|
|
|
txg = ztest_tx_assign(tx, TXG_MIGHTWAIT, FTAG);
|
|
|
|
if (txg == 0)
|
2010-08-26 22:13:05 +04:00
|
|
|
goto out;
|
2010-05-29 00:45:14 +04:00
|
|
|
error = zap_add(os, object, name, sizeof (uint64_t), 1,
|
|
|
|
&value, tx);
|
|
|
|
ASSERT(error == 0 || error == EEXIST);
|
|
|
|
dmu_tx_commit(tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-08-26 22:13:05 +04:00
|
|
|
out:
|
2013-11-01 23:26:11 +04:00
|
|
|
umem_free(od, sizeof (ztest_od_t));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/* ARGSUSED */
|
2008-11-20 23:01:55 +03:00
|
|
|
void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_zap_parallel(ztest_ds_t *zd, uint64_t id)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *os = zd->zd_os;
|
2010-08-26 22:13:05 +04:00
|
|
|
ztest_od_t *od;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t txg, object, count, wsize, wc, zl_wsize, zl_wc;
|
|
|
|
dmu_tx_t *tx;
|
|
|
|
int i, namelen, error;
|
2010-05-29 00:45:14 +04:00
|
|
|
int micro = ztest_random(2);
|
2008-11-20 23:01:55 +03:00
|
|
|
char name[20], string_value[20];
|
|
|
|
void *data;
|
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
od = umem_alloc(sizeof (ztest_od_t), UMEM_NOFAIL);
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_od_init(od, ID_PARALLEL, FTAG, micro, DMU_OT_ZAP_OTHER, 0, 0, 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
if (ztest_object_init(zd, od, sizeof (ztest_od_t), B_FALSE) != 0) {
|
2013-11-01 23:26:11 +04:00
|
|
|
umem_free(od, sizeof (ztest_od_t));
|
2010-05-29 00:45:14 +04:00
|
|
|
return;
|
2010-08-26 22:13:05 +04:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
object = od->od_object;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Generate a random name of the form 'xxx.....' where each
|
|
|
|
* x is a random printable character and the dots are dots.
|
|
|
|
* There are 94 such characters, and the name length goes from
|
|
|
|
* 6 to 20, so there are 94^3 * 15 = 12,458,760 possible names.
|
|
|
|
*/
|
|
|
|
namelen = ztest_random(sizeof (name) - 5) + 5 + 1;
|
|
|
|
|
|
|
|
for (i = 0; i < 3; i++)
|
|
|
|
name[i] = '!' + ztest_random('~' - '!' + 1);
|
|
|
|
for (; i < namelen - 1; i++)
|
|
|
|
name[i] = '.';
|
|
|
|
name[i] = '\0';
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if ((namelen & 1) || micro) {
|
2008-11-20 23:01:55 +03:00
|
|
|
wsize = sizeof (txg);
|
|
|
|
wc = 1;
|
|
|
|
data = &txg;
|
|
|
|
} else {
|
|
|
|
wsize = 1;
|
|
|
|
wc = namelen;
|
|
|
|
data = string_value;
|
|
|
|
}
|
|
|
|
|
|
|
|
count = -1ULL;
|
2013-09-04 16:00:57 +04:00
|
|
|
VERIFY0(zap_count(os, object, &count));
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3S(count, !=, -1ULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Select an operation: length, lookup, add, update, remove.
|
|
|
|
*/
|
|
|
|
i = ztest_random(5);
|
|
|
|
|
|
|
|
if (i >= 2) {
|
|
|
|
tx = dmu_tx_create(os);
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_tx_hold_zap(tx, object, B_TRUE, NULL);
|
|
|
|
txg = ztest_tx_assign(tx, TXG_MIGHTWAIT, FTAG);
|
2016-07-28 23:10:05 +03:00
|
|
|
if (txg == 0) {
|
|
|
|
umem_free(od, sizeof (ztest_od_t));
|
2008-11-20 23:01:55 +03:00
|
|
|
return;
|
2016-07-28 23:10:05 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
bcopy(name, string_value, namelen);
|
|
|
|
} else {
|
|
|
|
tx = NULL;
|
|
|
|
txg = 0;
|
|
|
|
bzero(string_value, namelen);
|
|
|
|
}
|
|
|
|
|
|
|
|
switch (i) {
|
|
|
|
|
|
|
|
case 0:
|
|
|
|
error = zap_length(os, object, name, &zl_wsize, &zl_wc);
|
|
|
|
if (error == 0) {
|
|
|
|
ASSERT3U(wsize, ==, zl_wsize);
|
|
|
|
ASSERT3U(wc, ==, zl_wc);
|
|
|
|
} else {
|
|
|
|
ASSERT3U(error, ==, ENOENT);
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
|
|
|
|
case 1:
|
|
|
|
error = zap_lookup(os, object, name, wsize, wc, data);
|
|
|
|
if (error == 0) {
|
|
|
|
if (data == string_value &&
|
|
|
|
bcmp(name, data, namelen) != 0)
|
|
|
|
fatal(0, "name '%s' != val '%s' len %d",
|
|
|
|
name, data, namelen);
|
|
|
|
} else {
|
|
|
|
ASSERT3U(error, ==, ENOENT);
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
|
|
|
|
case 2:
|
|
|
|
error = zap_add(os, object, name, wsize, wc, data, tx);
|
|
|
|
ASSERT(error == 0 || error == EEXIST);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case 3:
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(zap_update(os, object, name, wsize, wc, data, tx));
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
|
|
|
|
case 4:
|
|
|
|
error = zap_remove(os, object, name, tx);
|
|
|
|
ASSERT(error == 0 || error == ENOENT);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (tx != NULL)
|
|
|
|
dmu_tx_commit(tx);
|
2010-08-26 22:13:05 +04:00
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
umem_free(od, sizeof (ztest_od_t));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Commit callback data.
|
|
|
|
*/
|
|
|
|
typedef struct ztest_cb_data {
|
|
|
|
list_node_t zcd_node;
|
|
|
|
uint64_t zcd_txg;
|
|
|
|
int zcd_expected_err;
|
|
|
|
boolean_t zcd_added;
|
|
|
|
boolean_t zcd_called;
|
|
|
|
spa_t *zcd_spa;
|
|
|
|
} ztest_cb_data_t;
|
|
|
|
|
|
|
|
/* This is the actual commit callback function */
|
|
|
|
static void
|
|
|
|
ztest_commit_callback(void *arg, int error)
|
|
|
|
{
|
|
|
|
ztest_cb_data_t *data = arg;
|
|
|
|
uint64_t synced_txg;
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY3P(data, !=, NULL);
|
2010-05-29 00:45:14 +04:00
|
|
|
VERIFY3S(data->zcd_expected_err, ==, error);
|
|
|
|
VERIFY(!data->zcd_called);
|
|
|
|
|
|
|
|
synced_txg = spa_last_synced_txg(data->zcd_spa);
|
|
|
|
if (data->zcd_txg > synced_txg)
|
|
|
|
fatal(0, "commit callback of txg %" PRIu64 " called prematurely"
|
|
|
|
", last synced txg = %" PRIu64 "\n", data->zcd_txg,
|
|
|
|
synced_txg);
|
|
|
|
|
|
|
|
data->zcd_called = B_TRUE;
|
|
|
|
|
|
|
|
if (error == ECANCELED) {
|
2013-05-11 01:17:03 +04:00
|
|
|
ASSERT0(data->zcd_txg);
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(!data->zcd_added);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The private callback data should be destroyed here, but
|
|
|
|
* since we are going to check the zcd_called field after
|
|
|
|
* dmu_tx_abort(), we will destroy it there.
|
|
|
|
*/
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2010-08-26 21:17:18 +04:00
|
|
|
ASSERT(data->zcd_added);
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT3U(data->zcd_txg, !=, 0);
|
|
|
|
|
2010-08-26 21:43:27 +04:00
|
|
|
(void) mutex_enter(&zcl.zcl_callbacks_lock);
|
2010-08-26 21:17:18 +04:00
|
|
|
|
|
|
|
/* See if this cb was called more quickly */
|
|
|
|
if ((synced_txg - data->zcd_txg) < zc_min_txg_delay)
|
|
|
|
zc_min_txg_delay = synced_txg - data->zcd_txg;
|
|
|
|
|
|
|
|
/* Remove our callback from the list */
|
2010-05-29 00:45:14 +04:00
|
|
|
list_remove(&zcl.zcl_callbacks, data);
|
2010-08-26 21:17:18 +04:00
|
|
|
|
2010-08-26 21:43:27 +04:00
|
|
|
(void) mutex_exit(&zcl.zcl_callbacks_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
umem_free(data, sizeof (ztest_cb_data_t));
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Allocate and initialize callback data structure */
|
|
|
|
static ztest_cb_data_t *
|
|
|
|
ztest_create_cb_data(objset_t *os, uint64_t txg)
|
|
|
|
{
|
|
|
|
ztest_cb_data_t *cb_data;
|
|
|
|
|
|
|
|
cb_data = umem_zalloc(sizeof (ztest_cb_data_t), UMEM_NOFAIL);
|
|
|
|
|
|
|
|
cb_data->zcd_txg = txg;
|
|
|
|
cb_data->zcd_spa = dmu_objset_spa(os);
|
2010-08-26 21:43:27 +04:00
|
|
|
list_link_init(&cb_data->zcd_node);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
return (cb_data);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Commit callback test.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_dmu_commit_callbacks(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
|
|
|
objset_t *os = zd->zd_os;
|
2010-08-26 22:13:05 +04:00
|
|
|
ztest_od_t *od;
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_tx_t *tx;
|
|
|
|
ztest_cb_data_t *cb_data[3], *tmp_cb;
|
|
|
|
uint64_t old_txg, txg;
|
2010-08-26 21:17:18 +04:00
|
|
|
int i, error = 0;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
od = umem_alloc(sizeof (ztest_od_t), UMEM_NOFAIL);
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_od_init(od, id, FTAG, 0, DMU_OT_UINT64_OTHER, 0, 0, 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
if (ztest_object_init(zd, od, sizeof (ztest_od_t), B_FALSE) != 0) {
|
2013-11-01 23:26:11 +04:00
|
|
|
umem_free(od, sizeof (ztest_od_t));
|
2010-05-29 00:45:14 +04:00
|
|
|
return;
|
2010-08-26 22:13:05 +04:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
tx = dmu_tx_create(os);
|
|
|
|
|
|
|
|
cb_data[0] = ztest_create_cb_data(os, 0);
|
|
|
|
dmu_tx_callback_register(tx, ztest_commit_callback, cb_data[0]);
|
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
dmu_tx_hold_write(tx, od->od_object, 0, sizeof (uint64_t));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
/* Every once in a while, abort the transaction on purpose */
|
|
|
|
if (ztest_random(100) == 0)
|
|
|
|
error = -1;
|
|
|
|
|
|
|
|
if (!error)
|
|
|
|
error = dmu_tx_assign(tx, TXG_NOWAIT);
|
|
|
|
|
|
|
|
txg = error ? 0 : dmu_tx_get_txg(tx);
|
|
|
|
|
|
|
|
cb_data[0]->zcd_txg = txg;
|
|
|
|
cb_data[1] = ztest_create_cb_data(os, txg);
|
|
|
|
dmu_tx_callback_register(tx, ztest_commit_callback, cb_data[1]);
|
|
|
|
|
|
|
|
if (error) {
|
|
|
|
/*
|
|
|
|
* It's not a strict requirement to call the registered
|
|
|
|
* callbacks from inside dmu_tx_abort(), but that's what
|
|
|
|
* it's supposed to happen in the current implementation
|
|
|
|
* so we will check for that.
|
|
|
|
*/
|
|
|
|
for (i = 0; i < 2; i++) {
|
|
|
|
cb_data[i]->zcd_expected_err = ECANCELED;
|
|
|
|
VERIFY(!cb_data[i]->zcd_called);
|
|
|
|
}
|
|
|
|
|
|
|
|
dmu_tx_abort(tx);
|
|
|
|
|
|
|
|
for (i = 0; i < 2; i++) {
|
|
|
|
VERIFY(cb_data[i]->zcd_called);
|
|
|
|
umem_free(cb_data[i], sizeof (ztest_cb_data_t));
|
|
|
|
}
|
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
umem_free(od, sizeof (ztest_od_t));
|
2010-05-29 00:45:14 +04:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
cb_data[2] = ztest_create_cb_data(os, txg);
|
|
|
|
dmu_tx_callback_register(tx, ztest_commit_callback, cb_data[2]);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Read existing data to make sure there isn't a future leak.
|
|
|
|
*/
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_read(os, od->od_object, 0, sizeof (uint64_t),
|
2010-05-29 00:45:14 +04:00
|
|
|
&old_txg, DMU_READ_PREFETCH));
|
|
|
|
|
|
|
|
if (old_txg > txg)
|
|
|
|
fatal(0, "future leak: got %" PRIu64 ", open txg is %" PRIu64,
|
|
|
|
old_txg, txg);
|
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
dmu_write(os, od->od_object, 0, sizeof (uint64_t), &txg, tx);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-26 21:43:27 +04:00
|
|
|
(void) mutex_enter(&zcl.zcl_callbacks_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Since commit callbacks don't have any ordering requirement and since
|
|
|
|
* it is theoretically possible for a commit callback to be called
|
|
|
|
* after an arbitrary amount of time has elapsed since its txg has been
|
|
|
|
* synced, it is difficult to reliably determine whether a commit
|
|
|
|
* callback hasn't been called due to high load or due to a flawed
|
|
|
|
* implementation.
|
|
|
|
*
|
|
|
|
* In practice, we will assume that if after a certain number of txgs a
|
|
|
|
* commit callback hasn't been called, then most likely there's an
|
|
|
|
* implementation bug..
|
|
|
|
*/
|
|
|
|
tmp_cb = list_head(&zcl.zcl_callbacks);
|
|
|
|
if (tmp_cb != NULL &&
|
2010-08-26 21:17:18 +04:00
|
|
|
tmp_cb->zcd_txg + ZTEST_COMMIT_CB_THRESH < txg) {
|
2010-05-29 00:45:14 +04:00
|
|
|
fatal(0, "Commit callback threshold exceeded, oldest txg: %"
|
|
|
|
PRIu64 ", open txg: %" PRIu64 "\n", tmp_cb->zcd_txg, txg);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Let's find the place to insert our callbacks.
|
|
|
|
*
|
|
|
|
* Even though the list is ordered by txg, it is possible for the
|
|
|
|
* insertion point to not be the end because our txg may already be
|
|
|
|
* quiescing at this point and other callbacks in the open txg
|
|
|
|
* (from other objsets) may have sneaked in.
|
|
|
|
*/
|
|
|
|
tmp_cb = list_tail(&zcl.zcl_callbacks);
|
|
|
|
while (tmp_cb != NULL && tmp_cb->zcd_txg > txg)
|
|
|
|
tmp_cb = list_prev(&zcl.zcl_callbacks, tmp_cb);
|
|
|
|
|
|
|
|
/* Add the 3 callbacks to the list */
|
|
|
|
for (i = 0; i < 3; i++) {
|
|
|
|
if (tmp_cb == NULL)
|
|
|
|
list_insert_head(&zcl.zcl_callbacks, cb_data[i]);
|
|
|
|
else
|
|
|
|
list_insert_after(&zcl.zcl_callbacks, tmp_cb,
|
|
|
|
cb_data[i]);
|
|
|
|
|
|
|
|
cb_data[i]->zcd_added = B_TRUE;
|
|
|
|
VERIFY(!cb_data[i]->zcd_called);
|
|
|
|
|
|
|
|
tmp_cb = cb_data[i];
|
|
|
|
}
|
|
|
|
|
2010-08-26 21:17:18 +04:00
|
|
|
zc_cb_counter += 3;
|
|
|
|
|
2010-08-26 21:43:27 +04:00
|
|
|
(void) mutex_exit(&zcl.zcl_callbacks_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
dmu_tx_commit(tx);
|
2010-08-26 22:13:05 +04:00
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
umem_free(od, sizeof (ztest_od_t));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
/*
|
|
|
|
* Visit each object in the dataset. Verify that its properties
|
|
|
|
* are consistent what was stored in the block tag when it was created,
|
|
|
|
* and that its unused bonus buffer space has not been overwritten.
|
|
|
|
*/
|
2017-06-29 20:18:03 +03:00
|
|
|
/* ARGSUSED */
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
void
|
|
|
|
ztest_verify_dnode_bt(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
|
|
|
objset_t *os = zd->zd_os;
|
|
|
|
uint64_t obj;
|
|
|
|
int err = 0;
|
|
|
|
|
|
|
|
for (obj = 0; err == 0; err = dmu_object_next(os, &obj, FALSE, 0)) {
|
|
|
|
ztest_block_tag_t *bt = NULL;
|
|
|
|
dmu_object_info_t doi;
|
|
|
|
dmu_buf_t *db;
|
|
|
|
|
2017-12-07 02:30:15 +03:00
|
|
|
ztest_object_lock(zd, obj, RL_READER);
|
|
|
|
if (dmu_bonus_hold(os, obj, FTAG, &db) != 0) {
|
|
|
|
ztest_object_unlock(zd, obj);
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
continue;
|
2017-12-07 02:30:15 +03:00
|
|
|
}
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
|
|
|
|
dmu_object_info_from_db(db, &doi);
|
|
|
|
if (doi.doi_bonus_size >= sizeof (*bt))
|
|
|
|
bt = ztest_bt_bonus(db);
|
|
|
|
|
|
|
|
if (bt && bt->bt_magic == BT_MAGIC) {
|
|
|
|
ztest_bt_verify(bt, os, obj, doi.doi_dnodesize,
|
|
|
|
bt->bt_offset, bt->bt_gen, bt->bt_txg,
|
|
|
|
bt->bt_crtxg);
|
|
|
|
ztest_verify_unused_bonus(db, bt, obj, os, bt->bt_gen);
|
|
|
|
}
|
|
|
|
|
|
|
|
dmu_buf_rele(db, FTAG);
|
2017-12-07 02:30:15 +03:00
|
|
|
ztest_object_unlock(zd, obj);
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/* ARGSUSED */
|
|
|
|
void
|
|
|
|
ztest_dsl_prop_get_set(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
|
|
|
zfs_prop_t proplist[] = {
|
|
|
|
ZFS_PROP_CHECKSUM,
|
|
|
|
ZFS_PROP_COMPRESSION,
|
|
|
|
ZFS_PROP_COPIES,
|
|
|
|
ZFS_PROP_DEDUP
|
|
|
|
};
|
2010-08-26 20:52:39 +04:00
|
|
|
int p;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_rdlock(&ztest_name_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (p = 0; p < sizeof (proplist) / sizeof (proplist[0]); p++)
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) ztest_dsl_prop_set_uint64(zd->zd_name, proplist[p],
|
|
|
|
ztest_random_dsl_prop(proplist[p]), (int)ztest_random(2));
|
|
|
|
|
2015-01-29 23:45:40 +03:00
|
|
|
VERIFY0(ztest_dsl_prop_set_uint64(zd->zd_name, ZFS_PROP_RECORDSIZE,
|
|
|
|
ztest_random_blocksize(), (int)ztest_random(2)));
|
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&ztest_name_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/* ARGSUSED */
|
|
|
|
void
|
|
|
|
ztest_spa_prop_get_set(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
|
|
|
nvlist_t *props = NULL;
|
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_rdlock(&ztest_name_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2019-03-29 19:13:20 +03:00
|
|
|
(void) ztest_spa_prop_set_uint64(ZPOOL_PROP_AUTOTRIM, ztest_random(2));
|
|
|
|
|
2013-05-11 01:17:03 +04:00
|
|
|
VERIFY0(spa_prop_get(ztest_spa, &props));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 6)
|
2010-05-29 00:45:14 +04:00
|
|
|
dump_nvlist(props, 4);
|
|
|
|
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(props);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&ztest_name_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
static int
|
|
|
|
user_release_one(const char *snapname, const char *holdname)
|
|
|
|
{
|
|
|
|
nvlist_t *snaps, *holds;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
snaps = fnvlist_alloc();
|
|
|
|
holds = fnvlist_alloc();
|
|
|
|
fnvlist_add_boolean(holds, holdname);
|
|
|
|
fnvlist_add_nvlist(snaps, snapname, holds);
|
|
|
|
fnvlist_free(holds);
|
|
|
|
error = dsl_dataset_user_release(snaps, NULL);
|
|
|
|
fnvlist_free(snaps);
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Test snapshot hold/release and deferred destroy.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ztest_dmu_snapshot_hold(ztest_ds_t *zd, uint64_t id)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
int error;
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *os = zd->zd_os;
|
|
|
|
objset_t *origin;
|
|
|
|
char snapname[100];
|
|
|
|
char fullname[100];
|
|
|
|
char clonename[100];
|
|
|
|
char tag[100];
|
2016-06-16 00:28:36 +03:00
|
|
|
char osname[ZFS_MAX_DATASET_NAME_LEN];
|
2013-09-04 16:00:57 +04:00
|
|
|
nvlist_t *holds;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_rdlock(&ztest_name_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
dmu_objset_name(os, osname);
|
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
(void) snprintf(snapname, sizeof (snapname), "sh1_%llu",
|
|
|
|
(u_longlong_t)id);
|
2013-09-04 16:00:57 +04:00
|
|
|
(void) snprintf(fullname, sizeof (fullname), "%s@%s", osname, snapname);
|
|
|
|
(void) snprintf(clonename, sizeof (clonename),
|
2013-11-01 23:26:11 +04:00
|
|
|
"%s/ch1_%llu", osname, (u_longlong_t)id);
|
|
|
|
(void) snprintf(tag, sizeof (tag), "tag_%llu", (u_longlong_t)id);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Clean up from any previous run.
|
|
|
|
*/
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dsl_destroy_head(clonename);
|
|
|
|
if (error != ENOENT)
|
|
|
|
ASSERT0(error);
|
|
|
|
error = user_release_one(fullname, tag);
|
|
|
|
if (error != ESRCH && error != ENOENT)
|
|
|
|
ASSERT0(error);
|
|
|
|
error = dsl_destroy_snapshot(fullname, B_FALSE);
|
|
|
|
if (error != ENOENT)
|
|
|
|
ASSERT0(error);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Create snapshot, clone it, mark snap for deferred destroy,
|
|
|
|
* destroy clone, verify snap was also destroyed.
|
|
|
|
*/
|
2013-08-28 15:45:09 +04:00
|
|
|
error = dmu_objset_snapshot_one(osname, snapname);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error) {
|
|
|
|
if (error == ENOSPC) {
|
|
|
|
ztest_record_enospc("dmu_objset_snapshot");
|
|
|
|
goto out;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
fatal(0, "dmu_objset_snapshot(%s) = %d", fullname, error);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dmu_objset_clone(clonename, fullname);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error) {
|
2008-11-20 23:01:55 +03:00
|
|
|
if (error == ENOSPC) {
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_record_enospc("dmu_objset_clone");
|
|
|
|
goto out;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
fatal(0, "dmu_objset_clone(%s) = %d", clonename, error);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dsl_destroy_snapshot(fullname, B_TRUE);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error) {
|
2013-09-04 16:00:57 +04:00
|
|
|
fatal(0, "dsl_destroy_snapshot(%s, B_TRUE) = %d",
|
2010-05-29 00:45:14 +04:00
|
|
|
fullname, error);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dsl_destroy_head(clonename);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error)
|
2013-09-04 16:00:57 +04:00
|
|
|
fatal(0, "dsl_destroy_head(%s) = %d", clonename, error);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
error = dmu_objset_hold(fullname, FTAG, &origin);
|
|
|
|
if (error != ENOENT)
|
|
|
|
fatal(0, "dmu_objset_hold(%s) = %d", fullname, error);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Create snapshot, add temporary hold, verify that we can't
|
|
|
|
* destroy a held snapshot, mark for deferred destroy,
|
|
|
|
* release hold, verify snapshot was destroyed.
|
|
|
|
*/
|
2013-08-28 15:45:09 +04:00
|
|
|
error = dmu_objset_snapshot_one(osname, snapname);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error) {
|
|
|
|
if (error == ENOSPC) {
|
|
|
|
ztest_record_enospc("dmu_objset_snapshot");
|
|
|
|
goto out;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
fatal(0, "dmu_objset_snapshot(%s) = %d", fullname, error);
|
|
|
|
}
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
holds = fnvlist_alloc();
|
|
|
|
fnvlist_add_string(holds, fullname, tag);
|
|
|
|
error = dsl_dataset_user_hold(holds, 0, NULL);
|
|
|
|
fnvlist_free(holds);
|
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
if (error == ENOSPC) {
|
|
|
|
ztest_record_enospc("dsl_dataset_user_hold");
|
|
|
|
goto out;
|
|
|
|
} else if (error) {
|
|
|
|
fatal(0, "dsl_dataset_user_hold(%s, %s) = %u",
|
|
|
|
fullname, tag, error);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dsl_destroy_snapshot(fullname, B_FALSE);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error != EBUSY) {
|
2013-09-04 16:00:57 +04:00
|
|
|
fatal(0, "dsl_destroy_snapshot(%s, B_FALSE) = %d",
|
2010-05-29 00:45:14 +04:00
|
|
|
fullname, error);
|
|
|
|
}
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dsl_destroy_snapshot(fullname, B_TRUE);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error) {
|
2013-09-04 16:00:57 +04:00
|
|
|
fatal(0, "dsl_destroy_snapshot(%s, B_TRUE) = %d",
|
2010-05-29 00:45:14 +04:00
|
|
|
fullname, error);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
error = user_release_one(fullname, tag);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error)
|
2013-05-25 06:06:23 +04:00
|
|
|
fatal(0, "user_release_one(%s, %s) = %d", fullname, tag, error);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
VERIFY3U(dmu_objset_hold(fullname, FTAG, &origin), ==, ENOENT);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
out:
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&ztest_name_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Inject random faults into the on-disk data.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
/* ARGSUSED */
|
2008-11-20 23:01:55 +03:00
|
|
|
void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_fault_inject(ztest_ds_t *zd, uint64_t id)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_shared_t *zs = ztest_shared;
|
2012-01-24 06:43:32 +04:00
|
|
|
spa_t *spa = ztest_spa;
|
2008-11-20 23:01:55 +03:00
|
|
|
int fd;
|
|
|
|
uint64_t offset;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t leaves;
|
2010-08-26 20:52:39 +04:00
|
|
|
uint64_t bad = 0x1990c0ffeedecadeull;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t top, leaf;
|
2010-08-26 22:13:05 +04:00
|
|
|
char *path0;
|
|
|
|
char *pathrand;
|
2008-11-20 23:01:55 +03:00
|
|
|
size_t fsize;
|
2017-01-26 23:34:29 +03:00
|
|
|
int bshift = SPA_MAXBLOCKSHIFT + 2;
|
2008-11-20 23:01:55 +03:00
|
|
|
int iters = 1000;
|
2010-05-29 00:45:14 +04:00
|
|
|
int maxfaults;
|
|
|
|
int mirror_save;
|
2008-12-03 23:09:06 +03:00
|
|
|
vdev_t *vd0 = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t guid0 = 0;
|
2010-05-29 00:45:14 +04:00
|
|
|
boolean_t islog = B_FALSE;
|
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
path0 = umem_alloc(MAXPATHLEN, UMEM_NOFAIL);
|
|
|
|
pathrand = umem_alloc(MAXPATHLEN, UMEM_NOFAIL);
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_enter(&ztest_vdev_lock);
|
OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.
There are two underlying problems:
1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.
The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).
2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.
Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.
The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).
The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.
This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).
Porting notes:
* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
to dsl_scan_needs_resilver(), which was added to ZoL as part of the
sequential scrub work.
* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
parameter. The extra parameter is unique to ZoL.
* When posting indirect checksum errors the ABD can be passed directly,
zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-02-13 22:37:56 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Device removal is in progress, fault injection must be disabled
|
|
|
|
* until it completes and the pool is scrubbed. The fault injection
|
|
|
|
* strategy for damaging blocks does not take in to account evacuated
|
|
|
|
* blocks which may have already been damaged.
|
|
|
|
*/
|
|
|
|
if (ztest_device_removal_active) {
|
|
|
|
mutex_exit(&ztest_vdev_lock);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2018-02-05 23:00:26 +03:00
|
|
|
maxfaults = MAXFAULTS(zs);
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
leaves = MAX(zs->zs_mirrors, 1) * ztest_opts.zo_raid_children;
|
2010-05-29 00:45:14 +04:00
|
|
|
mirror_save = zs->zs_mirrors;
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_exit(&ztest_vdev_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(leaves, >=, 1);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2019-01-18 20:47:55 +03:00
|
|
|
/*
|
|
|
|
* While ztest is running the number of leaves will not change. This
|
|
|
|
* is critical for the fault injection logic as it determines where
|
|
|
|
* errors can be safely injected such that they are always repairable.
|
|
|
|
*
|
|
|
|
* When restarting ztest a different number of leaves may be requested
|
|
|
|
* which will shift the regions to be damaged. This is fine as long
|
|
|
|
* as the pool has been scrubbed prior to using the new mapping.
|
|
|
|
* Failure to do can result in non-repairable damage being injected.
|
|
|
|
*/
|
|
|
|
if (ztest_pool_scrubbed == B_FALSE)
|
|
|
|
goto out;
|
|
|
|
|
2013-08-07 22:24:34 +04:00
|
|
|
/*
|
|
|
|
* Grab the name lock as reader. There are some operations
|
|
|
|
* which don't like to have their vdevs changed while
|
|
|
|
* they are in progress (i.e. spa_change_guid). Those
|
|
|
|
* operations will have grabbed the name lock as writer.
|
|
|
|
*/
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_rdlock(&ztest_name_lock);
|
2013-08-07 22:24:34 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* We need SCL_STATE here because we're going to look at vd0->vdev_tsd.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_enter(spa, SCL_STATE, FTAG, RW_READER);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
if (ztest_random(2) == 0) {
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Inject errors on a normal data device or slog device.
|
2008-12-03 23:09:06 +03:00
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
top = ztest_random_vdev_top(spa, B_TRUE);
|
|
|
|
leaf = ztest_random(leaves) + zs->zs_splits;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* Generate paths to the first leaf in this top-level vdev,
|
|
|
|
* and to the random leaf we selected. We'll induce transient
|
|
|
|
* write failures and random online/offline activity on leaf 0,
|
|
|
|
* and we'll write random garbage to the randomly chosen leaf.
|
|
|
|
*/
|
Fix ztest vdev file paths.
Currently, in several instances (but not all), ztest generates vdev
file paths using a statement similar to this:
snprintf(path, sizeof (path), ztest_dev_template, ...);
This worked fine until 40b84e7aec6392187722e61e5a4a853b530bf60f, which
changed path to be a pointer to the heap instead of an array allocated
on the stack. Before this change, sizeof(path) would return the size of
the array; now, it returns the size of the pointer instead.
As a result, the aforementioned sprintf statement uses the wrong size
and truncates the vdev file path to the first 4 or 8 bytes (depending
on the architecture). Typically, with default settings, the file path
will become "/tmp/zt" instead of "/test/ztest.XXX".
This issue only exists in ztest_vdev_attach_detach() and
ztest_fault_inject(), which explains why ztest doesn't fail right away.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #989
2012-10-01 18:34:52 +04:00
|
|
|
(void) snprintf(path0, MAXPATHLEN, ztest_dev_template,
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_opts.zo_dir, ztest_opts.zo_pool,
|
|
|
|
top * leaves + zs->zs_splits);
|
Fix ztest vdev file paths.
Currently, in several instances (but not all), ztest generates vdev
file paths using a statement similar to this:
snprintf(path, sizeof (path), ztest_dev_template, ...);
This worked fine until 40b84e7aec6392187722e61e5a4a853b530bf60f, which
changed path to be a pointer to the heap instead of an array allocated
on the stack. Before this change, sizeof(path) would return the size of
the array; now, it returns the size of the pointer instead.
As a result, the aforementioned sprintf statement uses the wrong size
and truncates the vdev file path to the first 4 or 8 bytes (depending
on the architecture). Typically, with default settings, the file path
will become "/tmp/zt" instead of "/test/ztest.XXX".
This issue only exists in ztest_vdev_attach_detach() and
ztest_fault_inject(), which explains why ztest doesn't fail right away.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #989
2012-10-01 18:34:52 +04:00
|
|
|
(void) snprintf(pathrand, MAXPATHLEN, ztest_dev_template,
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_opts.zo_dir, ztest_opts.zo_pool,
|
|
|
|
top * leaves + leaf);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
vd0 = vdev_lookup_by_path(spa->spa_root_vdev, path0);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (vd0 != NULL && vd0->vdev_top->vdev_islog)
|
|
|
|
islog = B_TRUE;
|
|
|
|
|
2013-08-07 22:24:34 +04:00
|
|
|
/*
|
|
|
|
* If the top-level vdev needs to be resilvered
|
|
|
|
* then we only allow faults on the device that is
|
|
|
|
* resilvering.
|
|
|
|
*/
|
|
|
|
if (vd0 != NULL && maxfaults != 1 &&
|
|
|
|
(!vdev_resilver_needed(vd0->vdev_top, NULL, NULL) ||
|
2013-08-08 00:16:22 +04:00
|
|
|
vd0->vdev_resilver_txg != 0)) {
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* Make vd0 explicitly claim to be unreadable,
|
|
|
|
* or unwriteable, or reach behind its back
|
|
|
|
* and close the underlying fd. We can do this if
|
|
|
|
* maxfaults == 0 because we'll fail and reexecute,
|
|
|
|
* and we can do it if maxfaults >= 2 because we'll
|
|
|
|
* have enough redundancy. If maxfaults == 1, the
|
|
|
|
* combination of this with injection of random data
|
|
|
|
* corruption below exceeds the pool's fault tolerance.
|
|
|
|
*/
|
|
|
|
vdev_file_t *vf = vd0->vdev_tsd;
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
zfs_dbgmsg("injecting fault to vdev %llu; maxfaults=%d",
|
|
|
|
(long long)vd0->vdev_id, (int)maxfaults);
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
if (vf != NULL && ztest_random(3) == 0) {
|
2019-11-21 20:32:57 +03:00
|
|
|
(void) close(vf->vf_file->f_fd);
|
|
|
|
vf->vf_file->f_fd = -1;
|
2008-12-03 23:09:06 +03:00
|
|
|
} else if (ztest_random(2) == 0) {
|
|
|
|
vd0->vdev_cant_read = B_TRUE;
|
|
|
|
} else {
|
|
|
|
vd0->vdev_cant_write = B_TRUE;
|
|
|
|
}
|
|
|
|
guid0 = vd0->vdev_guid;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Inject errors on an l2cache device.
|
|
|
|
*/
|
|
|
|
spa_aux_vdev_t *sav = &spa->spa_l2cache;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
if (sav->sav_count == 0) {
|
|
|
|
spa_config_exit(spa, SCL_STATE, FTAG);
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&ztest_name_lock);
|
2010-08-26 22:13:05 +04:00
|
|
|
goto out;
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
|
|
|
vd0 = sav->sav_vdevs[ztest_random(sav->sav_count)];
|
2008-11-20 23:01:55 +03:00
|
|
|
guid0 = vd0->vdev_guid;
|
2008-12-03 23:09:06 +03:00
|
|
|
(void) strcpy(path0, vd0->vdev_path);
|
|
|
|
(void) strcpy(pathrand, vd0->vdev_path);
|
|
|
|
|
|
|
|
leaf = 0;
|
|
|
|
leaves = 1;
|
|
|
|
maxfaults = INT_MAX; /* no limit on cache devices */
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_exit(spa, SCL_STATE, FTAG);
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&ztest_name_lock);
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* If we can tolerate two or more faults, or we're dealing
|
|
|
|
* with a slog, randomly online/offline vd0.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
if ((maxfaults >= 2 || islog) && guid0 != 0) {
|
2009-01-16 00:59:39 +03:00
|
|
|
if (ztest_random(10) < 6) {
|
|
|
|
int flags = (ztest_random(2) == 0 ?
|
|
|
|
ZFS_OFFLINE_TEMPORARY : 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We have to grab the zs_name_lock as writer to
|
|
|
|
* prevent a race between offlining a slog and
|
|
|
|
* destroying a dataset. Offlining the slog will
|
|
|
|
* grab a reference on the dataset which may cause
|
2013-09-04 16:00:57 +04:00
|
|
|
* dsl_destroy_head() to fail with EBUSY thus
|
2010-05-29 00:45:14 +04:00
|
|
|
* leaving the dataset in an inconsistent state.
|
|
|
|
*/
|
|
|
|
if (islog)
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_wrlock(&ztest_name_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY3U(vdev_offline(spa, guid0, flags), !=, EBUSY);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (islog)
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&ztest_name_lock);
|
2009-01-16 00:59:39 +03:00
|
|
|
} else {
|
2012-12-22 02:57:09 +04:00
|
|
|
/*
|
|
|
|
* Ideally we would like to be able to randomly
|
|
|
|
* call vdev_[on|off]line without holding locks
|
|
|
|
* to force unpredictable failures but the side
|
|
|
|
* effects of vdev_[on|off]line prevent us from
|
|
|
|
* doing so. We grab the ztest_vdev_lock here to
|
|
|
|
* prevent a race between injection testing and
|
|
|
|
* aux_vdev removal.
|
|
|
|
*/
|
|
|
|
mutex_enter(&ztest_vdev_lock);
|
2009-01-16 00:59:39 +03:00
|
|
|
(void) vdev_online(spa, guid0, 0, NULL);
|
2012-12-22 02:57:09 +04:00
|
|
|
mutex_exit(&ztest_vdev_lock);
|
2009-01-16 00:59:39 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (maxfaults == 0)
|
2010-08-26 22:13:05 +04:00
|
|
|
goto out;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* We have at least single-fault tolerance, so inject data corruption.
|
|
|
|
*/
|
|
|
|
fd = open(pathrand, O_RDWR);
|
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
if (fd == -1) /* we hit a gap in the device namespace */
|
2010-08-26 22:13:05 +04:00
|
|
|
goto out;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
fsize = lseek(fd, 0, SEEK_END);
|
|
|
|
|
|
|
|
while (--iters != 0) {
|
2016-01-23 04:06:14 +03:00
|
|
|
/*
|
|
|
|
* The offset must be chosen carefully to ensure that
|
|
|
|
* we do not inject a given logical block with errors
|
|
|
|
* on two different leaf devices, because ZFS can not
|
|
|
|
* tolerate that (if maxfaults==1).
|
|
|
|
*
|
2020-06-22 19:48:36 +03:00
|
|
|
* To achieve this we divide each leaf device into
|
|
|
|
* chunks of size (# leaves * SPA_MAXBLOCKSIZE * 4).
|
|
|
|
* Each chunk is further divided into error-injection
|
|
|
|
* ranges (can accept errors) and clear ranges (we do
|
|
|
|
* not inject errors in those). Each error-injection
|
|
|
|
* range can accept errors only for a single leaf vdev.
|
|
|
|
* Error-injection ranges are separated by clear ranges.
|
2016-01-23 04:06:14 +03:00
|
|
|
*
|
|
|
|
* For example, with 3 leaves, each chunk looks like:
|
|
|
|
* 0 to 32M: injection range for leaf 0
|
2020-06-22 19:48:36 +03:00
|
|
|
* 32M to 64M: clear range - no injection allowed
|
2016-01-23 04:06:14 +03:00
|
|
|
* 64M to 96M: injection range for leaf 1
|
2020-06-22 19:48:36 +03:00
|
|
|
* 96M to 128M: clear range - no injection allowed
|
2016-01-23 04:06:14 +03:00
|
|
|
* 128M to 160M: injection range for leaf 2
|
2020-06-22 19:48:36 +03:00
|
|
|
* 160M to 192M: clear range - no injection allowed
|
|
|
|
*
|
|
|
|
* Each clear range must be large enough such that a
|
|
|
|
* single block cannot straddle it. This way a block
|
|
|
|
* can't be a target in two different injection ranges
|
|
|
|
* (on different leaf vdevs).
|
2016-01-23 04:06:14 +03:00
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
offset = ztest_random(fsize / (leaves << bshift)) *
|
|
|
|
(leaves << bshift) + (leaf << bshift) +
|
|
|
|
(ztest_random(1ULL << (bshift - 1)) & -8ULL);
|
|
|
|
|
2017-01-26 23:34:29 +03:00
|
|
|
/*
|
|
|
|
* Only allow damage to the labels at one end of the vdev.
|
|
|
|
*
|
|
|
|
* If all labels are damaged, the device will be totally
|
|
|
|
* inaccessible, which will result in loss of data,
|
|
|
|
* because we also damage (parts of) the other side of
|
|
|
|
* the mirror/raidz.
|
|
|
|
*
|
|
|
|
* Additionally, we will always have both an even and an
|
|
|
|
* odd label, so that we can handle crashes in the
|
|
|
|
* middle of vdev_config_sync().
|
|
|
|
*/
|
|
|
|
if ((leaf & 1) == 0 && offset < VDEV_LABEL_START_SIZE)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The two end labels are stored at the "end" of the disk, but
|
|
|
|
* the end of the disk (vdev_psize) is aligned to
|
|
|
|
* sizeof (vdev_label_t).
|
|
|
|
*/
|
|
|
|
uint64_t psize = P2ALIGN(fsize, sizeof (vdev_label_t));
|
|
|
|
if ((leaf & 1) == 1 &&
|
|
|
|
offset + sizeof (bad) > psize - VDEV_LABEL_END_SIZE)
|
2008-11-20 23:01:55 +03:00
|
|
|
continue;
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_enter(&ztest_vdev_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (mirror_save != zs->zs_mirrors) {
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_exit(&ztest_vdev_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) close(fd);
|
2010-08-26 22:13:05 +04:00
|
|
|
goto out;
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (pwrite(fd, &bad, sizeof (bad), offset) != sizeof (bad))
|
|
|
|
fatal(1, "can't inject bad word at 0x%llx in %s",
|
|
|
|
offset, pathrand);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_exit(&ztest_vdev_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 7)
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("injected bad word into %s,"
|
|
|
|
" offset 0x%llx\n", pathrand, (u_longlong_t)offset);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
(void) close(fd);
|
2010-08-26 22:13:05 +04:00
|
|
|
out:
|
|
|
|
umem_free(path0, MAXPATHLEN);
|
|
|
|
umem_free(pathrand, MAXPATHLEN);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2019-01-18 20:47:55 +03:00
|
|
|
/*
|
|
|
|
* By design ztest will never inject uncorrectable damage in to the pool.
|
|
|
|
* Issue a scrub, wait for it to complete, and verify there is never any
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
* persistent damage.
|
2019-01-18 20:47:55 +03:00
|
|
|
*
|
|
|
|
* Only after a full scrub has been completed is it safe to start injecting
|
|
|
|
* data corruption. See the comment in zfs_fault_inject().
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
ztest_scrub_impl(spa_t *spa)
|
|
|
|
{
|
|
|
|
int error = spa_scan(spa, POOL_SCAN_SCRUB);
|
|
|
|
if (error)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
while (dsl_scan_scrubbing(spa_get_dsl(spa)))
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
|
|
|
|
if (spa_get_errlog_size(spa) > 0)
|
|
|
|
return (ECKSUM);
|
|
|
|
|
|
|
|
ztest_pool_scrubbed = B_TRUE;
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Scrub the pool.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
/* ARGSUSED */
|
|
|
|
void
|
|
|
|
ztest_scrub(ztest_ds_t *zd, uint64_t id)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2012-01-24 06:43:32 +04:00
|
|
|
spa_t *spa = ztest_spa;
|
2019-01-18 20:47:55 +03:00
|
|
|
int error;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.
There are two underlying problems:
1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.
The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).
2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.
Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.
The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).
The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.
This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).
Porting notes:
* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
to dsl_scan_needs_resilver(), which was added to ZoL as part of the
sequential scrub work.
* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
parameter. The extra parameter is unique to ZoL.
* When posting indirect checksum errors the ABD can be passed directly,
zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-02-13 22:37:56 +03:00
|
|
|
/*
|
|
|
|
* Scrub in progress by device removal.
|
|
|
|
*/
|
|
|
|
if (ztest_device_removal_active)
|
|
|
|
return;
|
|
|
|
|
2019-01-18 20:47:55 +03:00
|
|
|
/*
|
|
|
|
* Start a scrub, wait a moment, then force a restart.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) spa_scan(spa, POOL_SCAN_SCRUB);
|
2019-01-18 20:47:55 +03:00
|
|
|
(void) poll(NULL, 0, 100);
|
|
|
|
|
|
|
|
error = ztest_scrub_impl(spa);
|
|
|
|
if (error == EBUSY)
|
|
|
|
error = 0;
|
|
|
|
ASSERT0(error);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2011-11-12 02:07:54 +04:00
|
|
|
/*
|
|
|
|
* Change the guid for the pool.
|
|
|
|
*/
|
|
|
|
/* ARGSUSED */
|
|
|
|
void
|
|
|
|
ztest_reguid(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
2012-01-24 06:43:32 +04:00
|
|
|
spa_t *spa = ztest_spa;
|
2011-11-12 02:07:54 +04:00
|
|
|
uint64_t orig, load;
|
2012-12-15 00:38:04 +04:00
|
|
|
int error;
|
2011-11-12 02:07:54 +04:00
|
|
|
|
2017-09-23 19:28:18 +03:00
|
|
|
if (ztest_opts.zo_mmp_test)
|
|
|
|
return;
|
|
|
|
|
2011-11-12 02:07:54 +04:00
|
|
|
orig = spa_guid(spa);
|
|
|
|
load = spa_load_guid(spa);
|
2012-12-15 00:38:04 +04:00
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_wrlock(&ztest_name_lock);
|
2012-12-15 00:38:04 +04:00
|
|
|
error = spa_change_guid(spa);
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&ztest_name_lock);
|
2012-12-15 00:38:04 +04:00
|
|
|
|
|
|
|
if (error != 0)
|
2011-11-12 02:07:54 +04:00
|
|
|
return;
|
|
|
|
|
2012-12-15 04:28:49 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 4) {
|
2011-11-12 02:07:54 +04:00
|
|
|
(void) printf("Changed guid old %llu -> %llu\n",
|
|
|
|
(u_longlong_t)orig, (u_longlong_t)spa_guid(spa));
|
|
|
|
}
|
|
|
|
|
|
|
|
VERIFY3U(orig, !=, spa_guid(spa));
|
|
|
|
VERIFY3U(load, ==, spa_load_guid(spa));
|
|
|
|
}
|
|
|
|
|
2015-12-10 02:34:16 +03:00
|
|
|
void
|
|
|
|
ztest_fletcher(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
|
|
|
hrtime_t end = gethrtime() + NANOSEC;
|
|
|
|
|
|
|
|
while (gethrtime() <= end) {
|
|
|
|
int run_count = 100;
|
|
|
|
void *buf;
|
2017-02-01 20:34:22 +03:00
|
|
|
struct abd *abd_data, *abd_meta;
|
2015-12-10 02:34:16 +03:00
|
|
|
uint32_t size;
|
|
|
|
int *ptr;
|
|
|
|
int i;
|
|
|
|
zio_cksum_t zc_ref;
|
|
|
|
zio_cksum_t zc_ref_byteswap;
|
|
|
|
|
|
|
|
size = ztest_random_blocksize();
|
2017-02-01 20:34:22 +03:00
|
|
|
|
2015-12-10 02:34:16 +03:00
|
|
|
buf = umem_alloc(size, UMEM_NOFAIL);
|
2017-02-01 20:34:22 +03:00
|
|
|
abd_data = abd_alloc(size, B_FALSE);
|
|
|
|
abd_meta = abd_alloc(size, B_TRUE);
|
2015-12-10 02:34:16 +03:00
|
|
|
|
|
|
|
for (i = 0, ptr = buf; i < size / sizeof (*ptr); i++, ptr++)
|
|
|
|
*ptr = ztest_random(UINT_MAX);
|
|
|
|
|
2017-02-01 20:34:22 +03:00
|
|
|
abd_copy_from_buf_off(abd_data, buf, 0, size);
|
|
|
|
abd_copy_from_buf_off(abd_meta, buf, 0, size);
|
|
|
|
|
2015-12-10 02:34:16 +03:00
|
|
|
VERIFY0(fletcher_4_impl_set("scalar"));
|
2016-06-16 01:47:05 +03:00
|
|
|
fletcher_4_native(buf, size, NULL, &zc_ref);
|
|
|
|
fletcher_4_byteswap(buf, size, NULL, &zc_ref_byteswap);
|
2015-12-10 02:34:16 +03:00
|
|
|
|
|
|
|
VERIFY0(fletcher_4_impl_set("cycle"));
|
|
|
|
while (run_count-- > 0) {
|
|
|
|
zio_cksum_t zc;
|
|
|
|
zio_cksum_t zc_byteswap;
|
|
|
|
|
2016-06-16 01:47:05 +03:00
|
|
|
fletcher_4_byteswap(buf, size, NULL, &zc_byteswap);
|
|
|
|
fletcher_4_native(buf, size, NULL, &zc);
|
2015-12-10 02:34:16 +03:00
|
|
|
|
|
|
|
VERIFY0(bcmp(&zc, &zc_ref, sizeof (zc)));
|
|
|
|
VERIFY0(bcmp(&zc_byteswap, &zc_ref_byteswap,
|
|
|
|
sizeof (zc_byteswap)));
|
2017-02-01 20:34:22 +03:00
|
|
|
|
|
|
|
/* Test ABD - data */
|
|
|
|
abd_fletcher_4_byteswap(abd_data, size, NULL,
|
|
|
|
&zc_byteswap);
|
|
|
|
abd_fletcher_4_native(abd_data, size, NULL, &zc);
|
|
|
|
|
|
|
|
VERIFY0(bcmp(&zc, &zc_ref, sizeof (zc)));
|
|
|
|
VERIFY0(bcmp(&zc_byteswap, &zc_ref_byteswap,
|
|
|
|
sizeof (zc_byteswap)));
|
|
|
|
|
|
|
|
/* Test ABD - metadata */
|
|
|
|
abd_fletcher_4_byteswap(abd_meta, size, NULL,
|
|
|
|
&zc_byteswap);
|
|
|
|
abd_fletcher_4_native(abd_meta, size, NULL, &zc);
|
|
|
|
|
|
|
|
VERIFY0(bcmp(&zc, &zc_ref, sizeof (zc)));
|
|
|
|
VERIFY0(bcmp(&zc_byteswap, &zc_ref_byteswap,
|
|
|
|
sizeof (zc_byteswap)));
|
|
|
|
|
2015-12-10 02:34:16 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
umem_free(buf, size);
|
2017-02-01 20:34:22 +03:00
|
|
|
abd_free(abd_data);
|
|
|
|
abd_free(abd_meta);
|
2015-12-10 02:34:16 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-09-23 04:52:29 +03:00
|
|
|
void
|
|
|
|
ztest_fletcher_incr(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
|
|
|
void *buf;
|
|
|
|
size_t size;
|
|
|
|
int *ptr;
|
|
|
|
int i;
|
|
|
|
zio_cksum_t zc_ref;
|
|
|
|
zio_cksum_t zc_ref_bswap;
|
|
|
|
|
|
|
|
hrtime_t end = gethrtime() + NANOSEC;
|
|
|
|
|
|
|
|
while (gethrtime() <= end) {
|
|
|
|
int run_count = 100;
|
|
|
|
|
|
|
|
size = ztest_random_blocksize();
|
|
|
|
buf = umem_alloc(size, UMEM_NOFAIL);
|
|
|
|
|
|
|
|
for (i = 0, ptr = buf; i < size / sizeof (*ptr); i++, ptr++)
|
|
|
|
*ptr = ztest_random(UINT_MAX);
|
|
|
|
|
|
|
|
VERIFY0(fletcher_4_impl_set("scalar"));
|
|
|
|
fletcher_4_native(buf, size, NULL, &zc_ref);
|
|
|
|
fletcher_4_byteswap(buf, size, NULL, &zc_ref_bswap);
|
|
|
|
|
|
|
|
VERIFY0(fletcher_4_impl_set("cycle"));
|
|
|
|
|
|
|
|
while (run_count-- > 0) {
|
|
|
|
zio_cksum_t zc;
|
|
|
|
zio_cksum_t zc_bswap;
|
|
|
|
size_t pos = 0;
|
|
|
|
|
|
|
|
ZIO_SET_CHECKSUM(&zc, 0, 0, 0, 0);
|
|
|
|
ZIO_SET_CHECKSUM(&zc_bswap, 0, 0, 0, 0);
|
|
|
|
|
|
|
|
while (pos < size) {
|
|
|
|
size_t inc = 64 * ztest_random(size / 67);
|
|
|
|
/* sometimes add few bytes to test non-simd */
|
|
|
|
if (ztest_random(100) < 10)
|
|
|
|
inc += P2ALIGN(ztest_random(64),
|
|
|
|
sizeof (uint32_t));
|
|
|
|
|
|
|
|
if (inc > (size - pos))
|
|
|
|
inc = size - pos;
|
|
|
|
|
|
|
|
fletcher_4_incremental_native(buf + pos, inc,
|
|
|
|
&zc);
|
|
|
|
fletcher_4_incremental_byteswap(buf + pos, inc,
|
|
|
|
&zc_bswap);
|
|
|
|
|
|
|
|
pos += inc;
|
|
|
|
}
|
|
|
|
|
|
|
|
VERIFY3U(pos, ==, size);
|
|
|
|
|
|
|
|
VERIFY(ZIO_CHECKSUM_EQUAL(zc, zc_ref));
|
|
|
|
VERIFY(ZIO_CHECKSUM_EQUAL(zc_bswap, zc_ref_bswap));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* verify if incremental on the whole buffer is
|
|
|
|
* equivalent to non-incremental version
|
|
|
|
*/
|
|
|
|
ZIO_SET_CHECKSUM(&zc, 0, 0, 0, 0);
|
|
|
|
ZIO_SET_CHECKSUM(&zc_bswap, 0, 0, 0, 0);
|
|
|
|
|
|
|
|
fletcher_4_incremental_native(buf, size, &zc);
|
|
|
|
fletcher_4_incremental_byteswap(buf, size, &zc_bswap);
|
|
|
|
|
|
|
|
VERIFY(ZIO_CHECKSUM_EQUAL(zc, zc_ref));
|
|
|
|
VERIFY(ZIO_CHECKSUM_EQUAL(zc_bswap, zc_ref_bswap));
|
|
|
|
}
|
|
|
|
|
|
|
|
umem_free(buf, size);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-02-18 14:20:09 +03:00
|
|
|
static int
|
|
|
|
ztest_set_global_vars(void)
|
|
|
|
{
|
|
|
|
for (size_t i = 0; i < ztest_opts.zo_gvars_count; i++) {
|
|
|
|
char *kv = ztest_opts.zo_gvars[i];
|
|
|
|
VERIFY3U(strlen(kv), <=, ZO_GVARS_MAX_ARGLEN);
|
|
|
|
VERIFY3U(strlen(kv), >, 0);
|
|
|
|
int err = set_global_var(kv);
|
|
|
|
if (ztest_opts.zo_verbose > 0) {
|
|
|
|
(void) printf("setting global var %s ... %s\n", kv,
|
|
|
|
err ? "failed" : "ok");
|
|
|
|
}
|
|
|
|
if (err != 0) {
|
|
|
|
(void) fprintf(stderr,
|
|
|
|
"failed to set global var '%s'\n", kv);
|
|
|
|
return (err);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static char **
|
|
|
|
ztest_global_vars_to_zdb_args(void)
|
|
|
|
{
|
|
|
|
char **args = calloc(2*ztest_opts.zo_gvars_count + 1, sizeof (char *));
|
|
|
|
char **cur = args;
|
|
|
|
for (size_t i = 0; i < ztest_opts.zo_gvars_count; i++) {
|
|
|
|
char *kv = ztest_opts.zo_gvars[i];
|
|
|
|
*cur = "-o";
|
|
|
|
cur++;
|
|
|
|
*cur = strdup(kv);
|
|
|
|
cur++;
|
|
|
|
}
|
|
|
|
ASSERT3P(cur, ==, &args[2*ztest_opts.zo_gvars_count]);
|
|
|
|
*cur = NULL;
|
|
|
|
return (args);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* The end of strings is indicated by a NULL element */
|
|
|
|
static char *
|
|
|
|
join_strings(char **strings, const char *sep)
|
|
|
|
{
|
|
|
|
size_t totallen = 0;
|
|
|
|
for (char **sp = strings; *sp != NULL; sp++) {
|
|
|
|
totallen += strlen(*sp);
|
|
|
|
totallen += strlen(sep);
|
|
|
|
}
|
|
|
|
if (totallen > 0) {
|
|
|
|
ASSERT(totallen >= strlen(sep));
|
|
|
|
totallen -= strlen(sep);
|
|
|
|
}
|
|
|
|
|
|
|
|
size_t buflen = totallen + 1;
|
|
|
|
char *o = malloc(buflen); /* trailing 0 byte */
|
|
|
|
o[0] = '\0';
|
|
|
|
for (char **sp = strings; *sp != NULL; sp++) {
|
|
|
|
size_t would;
|
|
|
|
would = strlcat(o, *sp, buflen);
|
|
|
|
VERIFY3U(would, <, buflen);
|
|
|
|
if (*(sp+1) == NULL) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
would = strlcat(o, sep, buflen);
|
|
|
|
VERIFY3U(would, <, buflen);
|
|
|
|
}
|
|
|
|
ASSERT3S(strlen(o), ==, totallen);
|
|
|
|
return (o);
|
|
|
|
}
|
|
|
|
|
2015-11-21 02:50:06 +03:00
|
|
|
static int
|
|
|
|
ztest_check_path(char *path)
|
|
|
|
{
|
|
|
|
struct stat s;
|
|
|
|
/* return true on success */
|
|
|
|
return (!stat(path, &s));
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
ztest_get_zdb_bin(char *bin, int len)
|
|
|
|
{
|
|
|
|
char *zdb_path;
|
|
|
|
/*
|
|
|
|
* Try to use ZDB_PATH and in-tree zdb path. If not successful, just
|
|
|
|
* let popen to search through PATH.
|
|
|
|
*/
|
|
|
|
if ((zdb_path = getenv("ZDB_PATH"))) {
|
|
|
|
strlcpy(bin, zdb_path, len); /* In env */
|
|
|
|
if (!ztest_check_path(bin)) {
|
|
|
|
ztest_dump_core = 0;
|
|
|
|
fatal(1, "invalid ZDB_PATH '%s'", bin);
|
|
|
|
}
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY3P(realpath(getexecname(), bin), !=, NULL);
|
2015-11-21 02:50:06 +03:00
|
|
|
if (strstr(bin, "/ztest/")) {
|
|
|
|
strstr(bin, "/ztest/")[0] = '\0'; /* In-tree */
|
|
|
|
strcat(bin, "/zdb/zdb");
|
|
|
|
if (ztest_check_path(bin))
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
strcpy(bin, "zdb");
|
|
|
|
}
|
|
|
|
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
static vdev_t *
|
|
|
|
ztest_random_concrete_vdev_leaf(vdev_t *vd)
|
|
|
|
{
|
|
|
|
if (vd == NULL)
|
|
|
|
return (NULL);
|
|
|
|
|
|
|
|
if (vd->vdev_children == 0)
|
|
|
|
return (vd);
|
|
|
|
|
|
|
|
vdev_t *eligible[vd->vdev_children];
|
|
|
|
int eligible_idx = 0, i;
|
|
|
|
for (i = 0; i < vd->vdev_children; i++) {
|
|
|
|
vdev_t *cvd = vd->vdev_child[i];
|
|
|
|
if (cvd->vdev_top->vdev_removing)
|
|
|
|
continue;
|
|
|
|
if (cvd->vdev_children > 0 ||
|
|
|
|
(vdev_is_concrete(cvd) && !cvd->vdev_detached)) {
|
|
|
|
eligible[eligible_idx++] = cvd;
|
|
|
|
}
|
|
|
|
}
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY3S(eligible_idx, >, 0);
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
|
|
|
|
uint64_t child_no = ztest_random(eligible_idx);
|
|
|
|
return (ztest_random_concrete_vdev_leaf(eligible[child_no]));
|
|
|
|
}
|
|
|
|
|
|
|
|
/* ARGSUSED */
|
|
|
|
void
|
|
|
|
ztest_initialize(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
|
|
|
spa_t *spa = ztest_spa;
|
|
|
|
int error = 0;
|
|
|
|
|
|
|
|
mutex_enter(&ztest_vdev_lock);
|
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
|
|
|
|
|
|
|
|
/* Random leaf vdev */
|
|
|
|
vdev_t *rand_vd = ztest_random_concrete_vdev_leaf(spa->spa_root_vdev);
|
|
|
|
if (rand_vd == NULL) {
|
|
|
|
spa_config_exit(spa, SCL_VDEV, FTAG);
|
|
|
|
mutex_exit(&ztest_vdev_lock);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The random vdev we've selected may change as soon as we
|
|
|
|
* drop the spa_config_lock. We create local copies of things
|
|
|
|
* we're interested in.
|
|
|
|
*/
|
|
|
|
uint64_t guid = rand_vd->vdev_guid;
|
|
|
|
char *path = strdup(rand_vd->vdev_path);
|
|
|
|
boolean_t active = rand_vd->vdev_initialize_thread != NULL;
|
|
|
|
|
2019-04-05 04:57:06 +03:00
|
|
|
zfs_dbgmsg("vd %px, guid %llu", rand_vd, guid);
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
spa_config_exit(spa, SCL_VDEV, FTAG);
|
|
|
|
|
|
|
|
uint64_t cmd = ztest_random(POOL_INITIALIZE_FUNCS);
|
2018-12-19 19:20:39 +03:00
|
|
|
|
|
|
|
nvlist_t *vdev_guids = fnvlist_alloc();
|
|
|
|
nvlist_t *vdev_errlist = fnvlist_alloc();
|
|
|
|
fnvlist_add_uint64(vdev_guids, path, guid);
|
|
|
|
error = spa_vdev_initialize(spa, vdev_guids, cmd, vdev_errlist);
|
|
|
|
fnvlist_free(vdev_guids);
|
|
|
|
fnvlist_free(vdev_errlist);
|
|
|
|
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
switch (cmd) {
|
|
|
|
case POOL_INITIALIZE_CANCEL:
|
|
|
|
if (ztest_opts.zo_verbose >= 4) {
|
|
|
|
(void) printf("Cancel initialize %s", path);
|
|
|
|
if (!active)
|
|
|
|
(void) printf(" failed (no initialize active)");
|
|
|
|
(void) printf("\n");
|
|
|
|
}
|
|
|
|
break;
|
2019-03-29 19:13:20 +03:00
|
|
|
case POOL_INITIALIZE_START:
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
if (ztest_opts.zo_verbose >= 4) {
|
|
|
|
(void) printf("Start initialize %s", path);
|
|
|
|
if (active && error == 0)
|
|
|
|
(void) printf(" failed (already active)");
|
|
|
|
else if (error != 0)
|
|
|
|
(void) printf(" failed (error %d)", error);
|
|
|
|
(void) printf("\n");
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case POOL_INITIALIZE_SUSPEND:
|
|
|
|
if (ztest_opts.zo_verbose >= 4) {
|
|
|
|
(void) printf("Suspend initialize %s", path);
|
|
|
|
if (!active)
|
|
|
|
(void) printf(" failed (no initialize active)");
|
|
|
|
(void) printf("\n");
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
free(path);
|
|
|
|
mutex_exit(&ztest_vdev_lock);
|
|
|
|
}
|
|
|
|
|
2019-03-29 19:13:20 +03:00
|
|
|
/* ARGSUSED */
|
|
|
|
void
|
|
|
|
ztest_trim(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
|
|
|
spa_t *spa = ztest_spa;
|
|
|
|
int error = 0;
|
|
|
|
|
|
|
|
mutex_enter(&ztest_vdev_lock);
|
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
|
|
|
|
|
|
|
|
/* Random leaf vdev */
|
|
|
|
vdev_t *rand_vd = ztest_random_concrete_vdev_leaf(spa->spa_root_vdev);
|
|
|
|
if (rand_vd == NULL) {
|
|
|
|
spa_config_exit(spa, SCL_VDEV, FTAG);
|
|
|
|
mutex_exit(&ztest_vdev_lock);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The random vdev we've selected may change as soon as we
|
|
|
|
* drop the spa_config_lock. We create local copies of things
|
|
|
|
* we're interested in.
|
|
|
|
*/
|
|
|
|
uint64_t guid = rand_vd->vdev_guid;
|
|
|
|
char *path = strdup(rand_vd->vdev_path);
|
|
|
|
boolean_t active = rand_vd->vdev_trim_thread != NULL;
|
|
|
|
|
|
|
|
zfs_dbgmsg("vd %p, guid %llu", rand_vd, guid);
|
|
|
|
spa_config_exit(spa, SCL_VDEV, FTAG);
|
|
|
|
|
|
|
|
uint64_t cmd = ztest_random(POOL_TRIM_FUNCS);
|
|
|
|
uint64_t rate = 1 << ztest_random(30);
|
|
|
|
boolean_t partial = (ztest_random(5) > 0);
|
|
|
|
boolean_t secure = (ztest_random(5) > 0);
|
|
|
|
|
|
|
|
nvlist_t *vdev_guids = fnvlist_alloc();
|
|
|
|
nvlist_t *vdev_errlist = fnvlist_alloc();
|
|
|
|
fnvlist_add_uint64(vdev_guids, path, guid);
|
|
|
|
error = spa_vdev_trim(spa, vdev_guids, cmd, rate, partial,
|
|
|
|
secure, vdev_errlist);
|
|
|
|
fnvlist_free(vdev_guids);
|
|
|
|
fnvlist_free(vdev_errlist);
|
|
|
|
|
|
|
|
switch (cmd) {
|
|
|
|
case POOL_TRIM_CANCEL:
|
|
|
|
if (ztest_opts.zo_verbose >= 4) {
|
|
|
|
(void) printf("Cancel TRIM %s", path);
|
|
|
|
if (!active)
|
|
|
|
(void) printf(" failed (no TRIM active)");
|
|
|
|
(void) printf("\n");
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case POOL_TRIM_START:
|
|
|
|
if (ztest_opts.zo_verbose >= 4) {
|
|
|
|
(void) printf("Start TRIM %s", path);
|
|
|
|
if (active && error == 0)
|
|
|
|
(void) printf(" failed (already active)");
|
|
|
|
else if (error != 0)
|
|
|
|
(void) printf(" failed (error %d)", error);
|
|
|
|
(void) printf("\n");
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case POOL_TRIM_SUSPEND:
|
|
|
|
if (ztest_opts.zo_verbose >= 4) {
|
|
|
|
(void) printf("Suspend TRIM %s", path);
|
|
|
|
if (!active)
|
|
|
|
(void) printf(" failed (no TRIM active)");
|
|
|
|
(void) printf("\n");
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
free(path);
|
|
|
|
mutex_exit(&ztest_vdev_lock);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Verify pool integrity by running zdb.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_run_zdb(char *pool)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
int status;
|
|
|
|
char *bin;
|
2010-08-26 22:59:11 +04:00
|
|
|
char *zdb;
|
|
|
|
char *zbuf;
|
2015-11-21 02:50:06 +03:00
|
|
|
const int len = MAXPATHLEN + MAXNAMELEN + 20;
|
2008-11-20 23:01:55 +03:00
|
|
|
FILE *fp;
|
|
|
|
|
2015-11-21 02:50:06 +03:00
|
|
|
bin = umem_alloc(len, UMEM_NOFAIL);
|
|
|
|
zdb = umem_alloc(len, UMEM_NOFAIL);
|
2010-08-26 22:59:11 +04:00
|
|
|
zbuf = umem_alloc(1024, UMEM_NOFAIL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-11-21 02:50:06 +03:00
|
|
|
ztest_get_zdb_bin(bin, len);
|
2010-08-26 22:59:11 +04:00
|
|
|
|
2021-02-18 14:20:09 +03:00
|
|
|
char **set_gvars_args = ztest_global_vars_to_zdb_args();
|
|
|
|
char *set_gvars_args_joined = join_strings(set_gvars_args, " ");
|
|
|
|
free(set_gvars_args);
|
|
|
|
|
|
|
|
size_t would = snprintf(zdb, len,
|
|
|
|
"%s -bcc%s%s -G -d -Y -e -y %s -p %s %s",
|
2010-08-26 22:59:11 +04:00
|
|
|
bin,
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_opts.zo_verbose >= 3 ? "s" : "",
|
|
|
|
ztest_opts.zo_verbose >= 4 ? "v" : "",
|
2021-02-18 14:20:09 +03:00
|
|
|
set_gvars_args_joined,
|
2020-05-30 07:14:10 +03:00
|
|
|
ztest_opts.zo_dir,
|
2008-12-03 23:09:06 +03:00
|
|
|
pool);
|
2021-02-18 14:20:09 +03:00
|
|
|
ASSERT3U(would, <, len);
|
|
|
|
|
|
|
|
free(set_gvars_args_joined);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 5)
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("Executing %s\n", strstr(zdb, "zdb "));
|
|
|
|
|
|
|
|
fp = popen(zdb, "r");
|
|
|
|
|
Fix ztest vdev file paths.
Currently, in several instances (but not all), ztest generates vdev
file paths using a statement similar to this:
snprintf(path, sizeof (path), ztest_dev_template, ...);
This worked fine until 40b84e7aec6392187722e61e5a4a853b530bf60f, which
changed path to be a pointer to the heap instead of an array allocated
on the stack. Before this change, sizeof(path) would return the size of
the array; now, it returns the size of the pointer instead.
As a result, the aforementioned sprintf statement uses the wrong size
and truncates the vdev file path to the first 4 or 8 bytes (depending
on the architecture). Typically, with default settings, the file path
will become "/tmp/zt" instead of "/test/ztest.XXX".
This issue only exists in ztest_vdev_attach_detach() and
ztest_fault_inject(), which explains why ztest doesn't fail right away.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #989
2012-10-01 18:34:52 +04:00
|
|
|
while (fgets(zbuf, 1024, fp) != NULL)
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 3)
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("%s", zbuf);
|
|
|
|
|
|
|
|
status = pclose(fp);
|
|
|
|
|
|
|
|
if (status == 0)
|
2010-08-26 22:59:11 +04:00
|
|
|
goto out;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ztest_dump_core = 0;
|
|
|
|
if (WIFEXITED(status))
|
|
|
|
fatal(0, "'%s' exit code %d", zdb, WEXITSTATUS(status));
|
|
|
|
else
|
|
|
|
fatal(0, "'%s' died with signal %d", zdb, WTERMSIG(status));
|
2010-08-26 22:59:11 +04:00
|
|
|
out:
|
2015-11-21 02:50:06 +03:00
|
|
|
umem_free(bin, len);
|
|
|
|
umem_free(zdb, len);
|
2010-08-26 22:59:11 +04:00
|
|
|
umem_free(zbuf, 1024);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
ztest_walk_pool_directory(char *header)
|
|
|
|
{
|
|
|
|
spa_t *spa = NULL;
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 6)
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("%s\n", header);
|
|
|
|
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
while ((spa = spa_next(spa)) != NULL)
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 6)
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("\t%s\n", spa_name(spa));
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
ztest_spa_import_export(char *oldname, char *newname)
|
|
|
|
{
|
2009-01-16 00:59:39 +03:00
|
|
|
nvlist_t *config, *newconfig;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t pool_guid;
|
|
|
|
spa_t *spa;
|
2013-09-04 16:00:57 +04:00
|
|
|
int error;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 4) {
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("import/export: old = %s, new = %s\n",
|
|
|
|
oldname, newname);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Clean up from previous runs.
|
|
|
|
*/
|
|
|
|
(void) spa_destroy(newname);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Get the pool's configuration and guid.
|
|
|
|
*/
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(spa_open(oldname, &spa, FTAG));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2009-01-16 00:59:39 +03:00
|
|
|
/*
|
|
|
|
* Kick off a scrub to tickle scrub/export races.
|
|
|
|
*/
|
|
|
|
if (ztest_random(2) == 0)
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) spa_scan(spa, POOL_SCAN_SCRUB);
|
2009-01-16 00:59:39 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
pool_guid = spa_guid(spa);
|
|
|
|
spa_close(spa, FTAG);
|
|
|
|
|
|
|
|
ztest_walk_pool_directory("pools before export");
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Export it.
|
|
|
|
*/
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(spa_export(oldname, &config, B_FALSE, B_FALSE));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ztest_walk_pool_directory("pools after export");
|
|
|
|
|
2009-01-16 00:59:39 +03:00
|
|
|
/*
|
|
|
|
* Try to import it.
|
|
|
|
*/
|
|
|
|
newconfig = spa_tryimport(config);
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3P(newconfig, !=, NULL);
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(newconfig);
|
2009-01-16 00:59:39 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Import it under the new name.
|
|
|
|
*/
|
2013-09-04 16:00:57 +04:00
|
|
|
error = spa_import(newname, config, NULL, 0);
|
|
|
|
if (error != 0) {
|
|
|
|
dump_nvlist(config, 0);
|
|
|
|
fatal(B_FALSE, "couldn't import pool %s as %s: error %u",
|
|
|
|
oldname, newname, error);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ztest_walk_pool_directory("pools after import");
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Try to import it again -- should fail with EEXIST.
|
|
|
|
*/
|
2010-08-27 01:24:34 +04:00
|
|
|
VERIFY3U(EEXIST, ==, spa_import(newname, config, NULL, 0));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Try to import it under a different name -- should fail with EEXIST.
|
|
|
|
*/
|
2010-08-27 01:24:34 +04:00
|
|
|
VERIFY3U(EEXIST, ==, spa_import(oldname, config, NULL, 0));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that the pool is no longer visible under the old name.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
VERIFY3U(ENOENT, ==, spa_open(oldname, &spa, FTAG));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that we can open and close the pool using the new name.
|
|
|
|
*/
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(spa_open(newname, &spa, FTAG));
|
|
|
|
ASSERT3U(pool_guid, ==, spa_guid(spa));
|
2008-11-20 23:01:55 +03:00
|
|
|
spa_close(spa, FTAG);
|
|
|
|
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(config);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2009-01-16 00:59:39 +03:00
|
|
|
static void
|
|
|
|
ztest_resume(spa_t *spa)
|
|
|
|
{
|
2012-01-24 06:43:32 +04:00
|
|
|
if (spa_suspended(spa) && ztest_opts.zo_verbose >= 6)
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("resuming from suspended state\n");
|
|
|
|
spa_vdev_state_enter(spa, SCL_NONE);
|
|
|
|
vdev_clear(spa, NULL);
|
|
|
|
(void) spa_vdev_state_exit(spa, NULL, 0);
|
|
|
|
(void) zio_resume(spa);
|
2009-01-16 00:59:39 +03:00
|
|
|
}
|
|
|
|
|
Simplify threads, mutexs, cvs and rwlocks
* Simplify threads, mutexs, cvs and rwlocks
* Update the zk_thread_create() function to use the same trick
as Illumos. Specifically, cast the new pthread_t to a void
pointer and return that as the kthread_t *. This avoids the
issues associated with managing a wrapper structure and is
safe as long as the callers never attempt to dereference it.
* Update all function prototypes passed to pthread_create() to
match the expected prototype. We were getting away this with
before since the function were explicitly cast.
* Replaced direct zk_thread_create() calls with thread_create()
for code consistency. All consumers of libzpool now use the
proper wrappers.
* The mutex_held() calls were converted to MUTEX_HELD().
* Removed all mutex_owner() calls and retired the interface.
Instead use MUTEX_HELD() which provides the same information
and allows the implementation details to be hidden. In this
case the use of the pthread_equals() function.
* The kthread_t, kmutex_t, krwlock_t, and krwlock_t types had
any non essential fields removed. In the case of kthread_t
and kcondvar_t they could be directly typedef'd to pthread_t
and pthread_cond_t respectively.
* Removed all extra ASSERTS from the thread, mutex, rwlock, and
cv wrapper functions. In practice, pthreads already provides
the vast majority of checks as long as we check the return
code. Removing this code from our wrappers help readability.
* Added TS_JOINABLE state flag to pass to request a joinable rather
than detached thread. This isn't a standard thread_create() state
but it's the least invasive way to pass this information and is
only used by ztest.
TEST_ZTEST_TIMEOUT=3600
Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4547
Closes #5503
Closes #5523
Closes #6377
Closes #6495
2017-08-11 18:51:44 +03:00
|
|
|
static void
|
2009-01-16 00:59:39 +03:00
|
|
|
ztest_resume_thread(void *arg)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_t *spa = arg;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
while (!ztest_exiting) {
|
2010-05-29 00:45:14 +04:00
|
|
|
if (spa_suspended(spa))
|
|
|
|
ztest_resume(spa);
|
|
|
|
(void) poll(NULL, 0, 100);
|
2016-06-02 07:04:53 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Periodically change the zfs_compressed_arc_enabled setting.
|
|
|
|
*/
|
|
|
|
if (ztest_random(10) == 0)
|
|
|
|
zfs_compressed_arc_enabled = ztest_random(2);
|
2016-07-22 18:52:49 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Periodically change the zfs_abd_scatter_enabled setting.
|
|
|
|
*/
|
|
|
|
if (ztest_random(10) == 0)
|
|
|
|
zfs_abd_scatter_enabled = ztest_random(2);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-08-26 21:43:27 +04:00
|
|
|
thread_exit();
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-26 21:43:27 +04:00
|
|
|
static void
|
2017-12-19 01:06:07 +03:00
|
|
|
ztest_deadman_thread(void *arg)
|
2010-08-26 21:43:27 +04:00
|
|
|
{
|
2017-12-19 01:06:07 +03:00
|
|
|
ztest_shared_t *zs = arg;
|
|
|
|
spa_t *spa = ztest_spa;
|
2018-10-10 23:48:33 +03:00
|
|
|
hrtime_t delay, overdue, last_run = gethrtime();
|
2017-12-19 01:06:07 +03:00
|
|
|
|
2018-10-10 23:48:33 +03:00
|
|
|
delay = (zs->zs_thread_stop - zs->zs_thread_start) +
|
|
|
|
MSEC2NSEC(zfs_deadman_synctime_ms);
|
2017-12-19 01:06:07 +03:00
|
|
|
|
2018-10-10 23:48:33 +03:00
|
|
|
while (!ztest_exiting) {
|
|
|
|
/*
|
|
|
|
* Wait for the delay timer while checking occasionally
|
|
|
|
* if we should stop.
|
|
|
|
*/
|
|
|
|
if (gethrtime() < last_run + delay) {
|
|
|
|
(void) poll(NULL, 0, 1000);
|
|
|
|
continue;
|
|
|
|
}
|
2017-12-19 01:06:07 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the pool is suspended then fail immediately. Otherwise,
|
|
|
|
* check to see if the pool is making any progress. If
|
|
|
|
* vdev_deadman() discovers that there hasn't been any recent
|
|
|
|
* I/Os then it will end up aborting the tests.
|
|
|
|
*/
|
|
|
|
if (spa_suspended(spa) || spa->spa_root_vdev == NULL) {
|
|
|
|
fatal(0, "aborting test after %llu seconds because "
|
|
|
|
"pool has transitioned to a suspended state.",
|
|
|
|
zfs_deadman_synctime_ms / 1000);
|
|
|
|
}
|
|
|
|
vdev_deadman(spa->spa_root_vdev, FTAG);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the process doesn't complete within a grace period of
|
|
|
|
* zfs_deadman_synctime_ms over the expected finish time,
|
|
|
|
* then it may be hung and is terminated.
|
|
|
|
*/
|
|
|
|
overdue = zs->zs_proc_stop + MSEC2NSEC(zfs_deadman_synctime_ms);
|
|
|
|
if (gethrtime() > overdue) {
|
|
|
|
fatal(0, "aborting test after %llu seconds because "
|
2018-10-10 23:48:33 +03:00
|
|
|
"the process is overdue for termination.",
|
|
|
|
(gethrtime() - zs->zs_proc_start) / NANOSEC);
|
2017-12-19 01:06:07 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
(void) printf("ztest has been running for %lld seconds\n",
|
2018-10-10 23:48:33 +03:00
|
|
|
(gethrtime() - zs->zs_proc_start) / NANOSEC);
|
|
|
|
|
|
|
|
last_run = gethrtime();
|
|
|
|
delay = MSEC2NSEC(zfs_deadman_checktime_ms);
|
2017-12-19 01:06:07 +03:00
|
|
|
}
|
2018-10-10 23:48:33 +03:00
|
|
|
|
|
|
|
thread_exit();
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_execute(int test, ztest_info_t *zi, uint64_t id)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_ds_t *zd = &ztest_ds[id % ztest_opts.zo_datasets];
|
|
|
|
ztest_shared_callstate_t *zc = ZTEST_GET_SHARED_CALLSTATE(test);
|
2010-05-29 00:45:14 +04:00
|
|
|
hrtime_t functime = gethrtime();
|
2010-08-26 20:52:39 +04:00
|
|
|
int i;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (i = 0; i < zi->zi_iters; i++)
|
2010-05-29 00:45:14 +04:00
|
|
|
zi->zi_func(zd, id);
|
|
|
|
|
|
|
|
functime = gethrtime() - functime;
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
atomic_add_64(&zc->zc_count, 1);
|
|
|
|
atomic_add_64(&zc->zc_time, functime);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2015-02-24 21:53:31 +03:00
|
|
|
if (ztest_opts.zo_verbose >= 4)
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("%6.2f sec in %s\n",
|
2015-02-24 21:53:31 +03:00
|
|
|
(double)functime / NANOSEC, zi->zi_funcname);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
Simplify threads, mutexs, cvs and rwlocks
* Simplify threads, mutexs, cvs and rwlocks
* Update the zk_thread_create() function to use the same trick
as Illumos. Specifically, cast the new pthread_t to a void
pointer and return that as the kthread_t *. This avoids the
issues associated with managing a wrapper structure and is
safe as long as the callers never attempt to dereference it.
* Update all function prototypes passed to pthread_create() to
match the expected prototype. We were getting away this with
before since the function were explicitly cast.
* Replaced direct zk_thread_create() calls with thread_create()
for code consistency. All consumers of libzpool now use the
proper wrappers.
* The mutex_held() calls were converted to MUTEX_HELD().
* Removed all mutex_owner() calls and retired the interface.
Instead use MUTEX_HELD() which provides the same information
and allows the implementation details to be hidden. In this
case the use of the pthread_equals() function.
* The kthread_t, kmutex_t, krwlock_t, and krwlock_t types had
any non essential fields removed. In the case of kthread_t
and kcondvar_t they could be directly typedef'd to pthread_t
and pthread_cond_t respectively.
* Removed all extra ASSERTS from the thread, mutex, rwlock, and
cv wrapper functions. In practice, pthreads already provides
the vast majority of checks as long as we check the return
code. Removing this code from our wrappers help readability.
* Added TS_JOINABLE state flag to pass to request a joinable rather
than detached thread. This isn't a standard thread_create() state
but it's the least invasive way to pass this information and is
only used by ztest.
TEST_ZTEST_TIMEOUT=3600
Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4547
Closes #5503
Closes #5523
Closes #6377
Closes #6495
2017-08-11 18:51:44 +03:00
|
|
|
static void
|
2008-11-20 23:01:55 +03:00
|
|
|
ztest_thread(void *arg)
|
|
|
|
{
|
2012-01-24 06:43:32 +04:00
|
|
|
int rand;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t id = (uintptr_t)arg;
|
2008-11-20 23:01:55 +03:00
|
|
|
ztest_shared_t *zs = ztest_shared;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t call_next;
|
|
|
|
hrtime_t now;
|
2008-11-20 23:01:55 +03:00
|
|
|
ztest_info_t *zi;
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_shared_callstate_t *zc;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
while ((now = gethrtime()) < zs->zs_thread_stop) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* See if it's time to force a crash.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
if (now > zs->zs_thread_kill)
|
|
|
|
ztest_kill(zs);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* If we're getting ENOSPC with some regularity, stop.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
if (zs->zs_enospc_count > 10)
|
|
|
|
break;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Pick a random function to execute.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2012-01-24 06:43:32 +04:00
|
|
|
rand = ztest_random(ZTEST_FUNCS);
|
|
|
|
zi = &ztest_info[rand];
|
|
|
|
zc = ZTEST_GET_SHARED_CALLSTATE(rand);
|
|
|
|
call_next = zc->zc_next;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (now >= call_next &&
|
2012-01-24 06:43:32 +04:00
|
|
|
atomic_cas_64(&zc->zc_next, call_next, call_next +
|
|
|
|
ztest_random(2 * zi->zi_interval[0] + 1)) == call_next) {
|
|
|
|
ztest_execute(rand, zi, id);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 21:43:27 +04:00
|
|
|
thread_exit();
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
ztest_dataset_name(char *dsname, char *pool, int d)
|
|
|
|
{
|
2016-06-16 00:28:36 +03:00
|
|
|
(void) snprintf(dsname, ZFS_MAX_DATASET_NAME_LEN, "%s/ds_%d", pool, d);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_dataset_destroy(int d)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2016-06-16 00:28:36 +03:00
|
|
|
char name[ZFS_MAX_DATASET_NAME_LEN];
|
2010-08-26 20:52:39 +04:00
|
|
|
int t;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_dataset_name(name, ztest_opts.zo_pool, d);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 3)
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("Destroying %s to free up space\n", name);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Cleanup any non-standard clones and snapshots. In general,
|
|
|
|
* ztest thread t operates on dataset (t % zopt_datasets),
|
|
|
|
* so there may be more than one thing to clean up.
|
|
|
|
*/
|
2012-01-24 06:43:32 +04:00
|
|
|
for (t = d; t < ztest_opts.zo_threads;
|
|
|
|
t += ztest_opts.zo_datasets)
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_dsl_dataset_cleanup(name, t);
|
|
|
|
|
|
|
|
(void) dmu_objset_find(name, ztest_objset_destroy_cb, NULL,
|
|
|
|
DS_FIND_SNAPSHOTS | DS_FIND_CHILDREN);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
ztest_dataset_dirobj_verify(ztest_ds_t *zd)
|
|
|
|
{
|
|
|
|
uint64_t usedobjs, dirobjs, scratch;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ZTEST_DIROBJ is the object directory for the entire dataset.
|
|
|
|
* Therefore, the number of objects in use should equal the
|
|
|
|
* number of ZTEST_DIROBJ entries, +1 for ZTEST_DIROBJ itself.
|
|
|
|
* If not, we have an object leak.
|
|
|
|
*
|
|
|
|
* Note that we can only check this in ztest_dataset_open(),
|
|
|
|
* when the open-context and syncing-context values agree.
|
|
|
|
* That's because zap_count() returns the open-context value,
|
|
|
|
* while dmu_objset_space() returns the rootbp fill count.
|
|
|
|
*/
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(zap_count(zd->zd_os, ZTEST_DIROBJ, &dirobjs));
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_objset_space(zd->zd_os, &scratch, &scratch, &usedobjs, &scratch);
|
|
|
|
ASSERT3U(dirobjs + 1, ==, usedobjs);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_dataset_open(int d)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_ds_t *zd = &ztest_ds[d];
|
|
|
|
uint64_t committed_seq = ZTEST_GET_SHARED_DS(d)->zd_seq;
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *os;
|
|
|
|
zilog_t *zilog;
|
2016-06-16 00:28:36 +03:00
|
|
|
char name[ZFS_MAX_DATASET_NAME_LEN];
|
2010-05-29 00:45:14 +04:00
|
|
|
int error;
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_dataset_name(name, ztest_opts.zo_pool, d);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_rdlock(&ztest_name_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
error = ztest_dataset_create(name);
|
|
|
|
if (error == ENOSPC) {
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&ztest_name_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_record_enospc(FTAG);
|
|
|
|
return (error);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(error == 0 || error == EEXIST);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-09-12 23:15:11 +03:00
|
|
|
VERIFY0(ztest_dmu_objset_own(name, DMU_OST_OTHER, B_FALSE,
|
|
|
|
B_TRUE, zd, &os));
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&ztest_name_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_zd_init(zd, ZTEST_GET_SHARED_DS(d), os);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
zilog = zd->zd_zilog;
|
|
|
|
|
|
|
|
if (zilog->zl_header->zh_claim_lr_seq != 0 &&
|
|
|
|
zilog->zl_header->zh_claim_lr_seq < committed_seq)
|
|
|
|
fatal(0, "missing log records: claimed %llu < committed %llu",
|
|
|
|
zilog->zl_header->zh_claim_lr_seq, committed_seq);
|
|
|
|
|
|
|
|
ztest_dataset_dirobj_verify(zd);
|
|
|
|
|
|
|
|
zil_replay(os, zd, ztest_replay_vector);
|
|
|
|
|
|
|
|
ztest_dataset_dirobj_verify(zd);
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 6)
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("%s replay %llu blocks, %llu records, seq %llu\n",
|
|
|
|
zd->zd_name,
|
|
|
|
(u_longlong_t)zilog->zl_parse_blk_count,
|
|
|
|
(u_longlong_t)zilog->zl_parse_lr_count,
|
|
|
|
(u_longlong_t)zilog->zl_replaying_seq);
|
|
|
|
|
|
|
|
zilog = zil_open(os, ztest_get_data);
|
|
|
|
|
|
|
|
if (zilog->zl_replaying_seq != 0 &&
|
|
|
|
zilog->zl_replaying_seq < committed_seq)
|
|
|
|
fatal(0, "missing log records: replayed %llu < committed %llu",
|
|
|
|
zilog->zl_replaying_seq, committed_seq);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_dataset_close(int d)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_ds_t *zd = &ztest_ds[d];
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
zil_close(zd->zd_zilog);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dmu_objset_disown(zd->zd_os, B_TRUE, zd);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
ztest_zd_fini(zd);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2018-11-08 02:40:24 +03:00
|
|
|
/* ARGSUSED */
|
|
|
|
static int
|
|
|
|
ztest_replay_zil_cb(const char *name, void *arg)
|
|
|
|
{
|
|
|
|
objset_t *os;
|
|
|
|
ztest_ds_t *zdtmp;
|
|
|
|
|
|
|
|
VERIFY0(ztest_dmu_objset_own(name, DMU_OST_ANY, B_TRUE,
|
|
|
|
B_TRUE, FTAG, &os));
|
|
|
|
|
|
|
|
zdtmp = umem_alloc(sizeof (ztest_ds_t), UMEM_NOFAIL);
|
|
|
|
|
|
|
|
ztest_zd_init(zdtmp, NULL, os);
|
|
|
|
zil_replay(os, zdtmp, ztest_replay_vector);
|
|
|
|
ztest_zd_fini(zdtmp);
|
|
|
|
|
|
|
|
if (dmu_objset_zil(os)->zl_parse_lr_count != 0 &&
|
|
|
|
ztest_opts.zo_verbose >= 6) {
|
|
|
|
zilog_t *zilog = dmu_objset_zil(os);
|
|
|
|
|
|
|
|
(void) printf("%s replay %llu blocks, %llu records, seq %llu\n",
|
|
|
|
name,
|
|
|
|
(u_longlong_t)zilog->zl_parse_blk_count,
|
|
|
|
(u_longlong_t)zilog->zl_parse_lr_count,
|
|
|
|
(u_longlong_t)zilog->zl_replaying_seq);
|
|
|
|
}
|
|
|
|
|
|
|
|
umem_free(zdtmp, sizeof (ztest_ds_t));
|
|
|
|
|
|
|
|
dmu_objset_disown(os, B_TRUE, FTAG);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2020-06-06 22:51:35 +03:00
|
|
|
static void
|
|
|
|
ztest_freeze(void)
|
|
|
|
{
|
|
|
|
ztest_ds_t *zd = &ztest_ds[0];
|
|
|
|
spa_t *spa;
|
|
|
|
int numloops = 0;
|
|
|
|
|
|
|
|
if (ztest_opts.zo_verbose >= 3)
|
|
|
|
(void) printf("testing spa_freeze()...\n");
|
|
|
|
|
|
|
|
kernel_init(SPA_MODE_READ | SPA_MODE_WRITE);
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(spa_open(ztest_opts.zo_pool, &spa, FTAG));
|
|
|
|
VERIFY0(ztest_dataset_open(0));
|
2020-06-06 22:51:35 +03:00
|
|
|
ztest_spa = spa;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Force the first log block to be transactionally allocated.
|
|
|
|
* We have to do this before we freeze the pool -- otherwise
|
|
|
|
* the log chain won't be anchored.
|
|
|
|
*/
|
|
|
|
while (BP_IS_HOLE(&zd->zd_zilog->zl_header->zh_log)) {
|
|
|
|
ztest_dmu_object_alloc_free(zd, 0);
|
|
|
|
zil_commit(zd->zd_zilog, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Freeze the pool. This stops spa_sync() from doing anything,
|
|
|
|
* so that the only way to record changes from now on is the ZIL.
|
|
|
|
*/
|
|
|
|
spa_freeze(spa);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Because it is hard to predict how much space a write will actually
|
|
|
|
* require beforehand, we leave ourselves some fudge space to write over
|
|
|
|
* capacity.
|
|
|
|
*/
|
|
|
|
uint64_t capacity = metaslab_class_get_space(spa_normal_class(spa)) / 2;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Run tests that generate log records but don't alter the pool config
|
|
|
|
* or depend on DSL sync tasks (snapshots, objset create/destroy, etc).
|
|
|
|
* We do a txg_wait_synced() after each iteration to force the txg
|
|
|
|
* to increase well beyond the last synced value in the uberblock.
|
|
|
|
* The ZIL should be OK with that.
|
|
|
|
*
|
|
|
|
* Run a random number of times less than zo_maxloops and ensure we do
|
|
|
|
* not run out of space on the pool.
|
|
|
|
*/
|
|
|
|
while (ztest_random(10) != 0 &&
|
|
|
|
numloops++ < ztest_opts.zo_maxloops &&
|
|
|
|
metaslab_class_get_alloc(spa_normal_class(spa)) < capacity) {
|
|
|
|
ztest_od_t od;
|
|
|
|
ztest_od_init(&od, 0, FTAG, 0, DMU_OT_UINT64_OTHER, 0, 0, 0);
|
|
|
|
VERIFY0(ztest_object_init(zd, &od, sizeof (od), B_FALSE));
|
|
|
|
ztest_io(zd, od.od_object,
|
|
|
|
ztest_random(ZTEST_RANGE_LOCKS) << SPA_MAXBLOCKSHIFT);
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Commit all of the changes we just generated.
|
|
|
|
*/
|
|
|
|
zil_commit(zd->zd_zilog, 0);
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Close our dataset and close the pool.
|
|
|
|
*/
|
|
|
|
ztest_dataset_close(0);
|
|
|
|
spa_close(spa, FTAG);
|
|
|
|
kernel_fini();
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Open and close the pool and dataset to induce log replay.
|
|
|
|
*/
|
|
|
|
kernel_init(SPA_MODE_READ | SPA_MODE_WRITE);
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(spa_open(ztest_opts.zo_pool, &spa, FTAG));
|
|
|
|
ASSERT3U(spa_freeze_txg(spa), ==, UINT64_MAX);
|
|
|
|
VERIFY0(ztest_dataset_open(0));
|
2020-06-06 22:51:35 +03:00
|
|
|
ztest_spa = spa;
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
ztest_dataset_close(0);
|
|
|
|
ztest_reguid(NULL, 0);
|
|
|
|
|
|
|
|
spa_close(spa, FTAG);
|
|
|
|
kernel_fini();
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
ztest_import_impl(ztest_shared_t *zs)
|
|
|
|
{
|
|
|
|
importargs_t args = { 0 };
|
|
|
|
nvlist_t *cfg = NULL;
|
|
|
|
int nsearch = 1;
|
|
|
|
char *searchdirs[nsearch];
|
|
|
|
int flags = ZFS_IMPORT_MISSING_LOG;
|
|
|
|
|
|
|
|
searchdirs[0] = ztest_opts.zo_dir;
|
|
|
|
args.paths = nsearch;
|
|
|
|
args.path = searchdirs;
|
|
|
|
args.can_be_active = B_FALSE;
|
|
|
|
|
|
|
|
VERIFY0(zpool_find_config(NULL, ztest_opts.zo_pool, &cfg, &args,
|
|
|
|
&libzpool_config_ops));
|
|
|
|
VERIFY0(spa_import(ztest_opts.zo_pool, cfg, NULL, flags));
|
2020-12-23 20:52:24 +03:00
|
|
|
fnvlist_free(cfg);
|
2020-06-06 22:51:35 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Import a storage pool with the given name.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
ztest_import(ztest_shared_t *zs)
|
|
|
|
{
|
|
|
|
spa_t *spa;
|
|
|
|
|
|
|
|
mutex_init(&ztest_vdev_lock, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
mutex_init(&ztest_checkpoint_lock, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
VERIFY0(pthread_rwlock_init(&ztest_name_lock, NULL));
|
|
|
|
|
|
|
|
kernel_init(SPA_MODE_READ | SPA_MODE_WRITE);
|
|
|
|
|
|
|
|
ztest_import_impl(zs);
|
|
|
|
|
|
|
|
VERIFY0(spa_open(ztest_opts.zo_pool, &spa, FTAG));
|
|
|
|
zs->zs_metaslab_sz =
|
|
|
|
1ULL << spa->spa_root_vdev->vdev_child[0]->vdev_ms_shift;
|
|
|
|
spa_close(spa, FTAG);
|
|
|
|
|
|
|
|
kernel_fini();
|
|
|
|
|
|
|
|
if (!ztest_opts.zo_mmp_test) {
|
|
|
|
ztest_run_zdb(ztest_opts.zo_pool);
|
|
|
|
ztest_freeze();
|
|
|
|
ztest_run_zdb(ztest_opts.zo_pool);
|
|
|
|
}
|
|
|
|
|
|
|
|
(void) pthread_rwlock_destroy(&ztest_name_lock);
|
|
|
|
mutex_destroy(&ztest_vdev_lock);
|
|
|
|
mutex_destroy(&ztest_checkpoint_lock);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Kick off threads to run tests on all datasets in parallel.
|
|
|
|
*/
|
|
|
|
static void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_run(ztest_shared_t *zs)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
spa_t *spa;
|
2011-11-12 02:07:54 +04:00
|
|
|
objset_t *os;
|
2018-10-10 23:48:33 +03:00
|
|
|
kthread_t *resume_thread, *deadman_thread;
|
Simplify threads, mutexs, cvs and rwlocks
* Simplify threads, mutexs, cvs and rwlocks
* Update the zk_thread_create() function to use the same trick
as Illumos. Specifically, cast the new pthread_t to a void
pointer and return that as the kthread_t *. This avoids the
issues associated with managing a wrapper structure and is
safe as long as the callers never attempt to dereference it.
* Update all function prototypes passed to pthread_create() to
match the expected prototype. We were getting away this with
before since the function were explicitly cast.
* Replaced direct zk_thread_create() calls with thread_create()
for code consistency. All consumers of libzpool now use the
proper wrappers.
* The mutex_held() calls were converted to MUTEX_HELD().
* Removed all mutex_owner() calls and retired the interface.
Instead use MUTEX_HELD() which provides the same information
and allows the implementation details to be hidden. In this
case the use of the pthread_equals() function.
* The kthread_t, kmutex_t, krwlock_t, and krwlock_t types had
any non essential fields removed. In the case of kthread_t
and kcondvar_t they could be directly typedef'd to pthread_t
and pthread_cond_t respectively.
* Removed all extra ASSERTS from the thread, mutex, rwlock, and
cv wrapper functions. In practice, pthreads already provides
the vast majority of checks as long as we check the return
code. Removing this code from our wrappers help readability.
* Added TS_JOINABLE state flag to pass to request a joinable rather
than detached thread. This isn't a standard thread_create() state
but it's the least invasive way to pass this information and is
only used by ztest.
TEST_ZTEST_TIMEOUT=3600
Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4547
Closes #5503
Closes #5523
Closes #6377
Closes #6495
2017-08-11 18:51:44 +03:00
|
|
|
kthread_t **run_threads;
|
2010-08-26 21:43:27 +04:00
|
|
|
uint64_t object;
|
2010-05-29 00:45:14 +04:00
|
|
|
int error;
|
2010-08-26 20:52:39 +04:00
|
|
|
int t, d;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
|
|
|
ztest_exiting = B_FALSE;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Initialize parent/child shared state.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_init(&ztest_vdev_lock, NULL, MUTEX_DEFAULT, NULL);
|
2016-12-17 01:11:29 +03:00
|
|
|
mutex_init(&ztest_checkpoint_lock, NULL, MUTEX_DEFAULT, NULL);
|
2018-06-05 02:52:10 +03:00
|
|
|
VERIFY0(pthread_rwlock_init(&ztest_name_lock, NULL));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
zs->zs_thread_start = gethrtime();
|
2012-01-24 06:43:32 +04:00
|
|
|
zs->zs_thread_stop =
|
|
|
|
zs->zs_thread_start + ztest_opts.zo_passtime * NANOSEC;
|
2010-05-29 00:45:14 +04:00
|
|
|
zs->zs_thread_stop = MIN(zs->zs_thread_stop, zs->zs_proc_stop);
|
|
|
|
zs->zs_thread_kill = zs->zs_thread_stop;
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_random(100) < ztest_opts.zo_killrate) {
|
|
|
|
zs->zs_thread_kill -=
|
|
|
|
ztest_random(ztest_opts.zo_passtime * NANOSEC);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 21:43:27 +04:00
|
|
|
mutex_init(&zcl.zcl_callbacks_lock, NULL, MUTEX_DEFAULT, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
list_create(&zcl.zcl_callbacks, sizeof (ztest_cb_data_t),
|
|
|
|
offsetof(ztest_cb_data_t, zcd_node));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2020-06-06 22:51:35 +03:00
|
|
|
* Open our pool. It may need to be imported first depending on
|
|
|
|
* what tests were running when the previous pass was terminated.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2019-11-21 20:32:57 +03:00
|
|
|
kernel_init(SPA_MODE_READ | SPA_MODE_WRITE);
|
2020-06-06 22:51:35 +03:00
|
|
|
error = spa_open(ztest_opts.zo_pool, &spa, FTAG);
|
|
|
|
if (error) {
|
|
|
|
VERIFY3S(error, ==, ENOENT);
|
|
|
|
ztest_import_impl(zs);
|
|
|
|
VERIFY0(spa_open(ztest_opts.zo_pool, &spa, FTAG));
|
|
|
|
zs->zs_metaslab_sz =
|
|
|
|
1ULL << spa->spa_root_vdev->vdev_child[0]->vdev_ms_shift;
|
|
|
|
}
|
|
|
|
|
2014-06-13 03:29:11 +04:00
|
|
|
metaslab_preload_limit = ztest_random(20) + 1;
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_spa = spa;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2019-07-12 19:31:20 +03:00
|
|
|
VERIFY0(vdev_raidz_impl_set("cycle"));
|
|
|
|
|
2017-01-26 23:30:43 +03:00
|
|
|
dmu_objset_stats_t dds;
|
2017-09-12 23:15:11 +03:00
|
|
|
VERIFY0(ztest_dmu_objset_own(ztest_opts.zo_pool,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
DMU_OST_ANY, B_TRUE, B_TRUE, FTAG, &os));
|
2017-01-26 23:30:43 +03:00
|
|
|
dsl_pool_config_enter(dmu_objset_pool(os), FTAG);
|
|
|
|
dmu_objset_fast_stat(os, &dds);
|
|
|
|
dsl_pool_config_exit(dmu_objset_pool(os), FTAG);
|
|
|
|
zs->zs_guid = dds.dds_guid;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dmu_objset_disown(os, B_TRUE, FTAG);
|
2011-11-12 02:07:54 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* Create a thread to periodically resume suspended I/O.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
Simplify threads, mutexs, cvs and rwlocks
* Simplify threads, mutexs, cvs and rwlocks
* Update the zk_thread_create() function to use the same trick
as Illumos. Specifically, cast the new pthread_t to a void
pointer and return that as the kthread_t *. This avoids the
issues associated with managing a wrapper structure and is
safe as long as the callers never attempt to dereference it.
* Update all function prototypes passed to pthread_create() to
match the expected prototype. We were getting away this with
before since the function were explicitly cast.
* Replaced direct zk_thread_create() calls with thread_create()
for code consistency. All consumers of libzpool now use the
proper wrappers.
* The mutex_held() calls were converted to MUTEX_HELD().
* Removed all mutex_owner() calls and retired the interface.
Instead use MUTEX_HELD() which provides the same information
and allows the implementation details to be hidden. In this
case the use of the pthread_equals() function.
* The kthread_t, kmutex_t, krwlock_t, and krwlock_t types had
any non essential fields removed. In the case of kthread_t
and kcondvar_t they could be directly typedef'd to pthread_t
and pthread_cond_t respectively.
* Removed all extra ASSERTS from the thread, mutex, rwlock, and
cv wrapper functions. In practice, pthreads already provides
the vast majority of checks as long as we check the return
code. Removing this code from our wrappers help readability.
* Added TS_JOINABLE state flag to pass to request a joinable rather
than detached thread. This isn't a standard thread_create() state
but it's the least invasive way to pass this information and is
only used by ztest.
TEST_ZTEST_TIMEOUT=3600
Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4547
Closes #5503
Closes #5523
Closes #6377
Closes #6495
2017-08-11 18:51:44 +03:00
|
|
|
resume_thread = thread_create(NULL, 0, ztest_resume_thread,
|
|
|
|
spa, 0, NULL, TS_RUN | TS_JOINABLE, defclsyspri);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
2017-12-19 01:06:07 +03:00
|
|
|
* Create a deadman thread and set to panic if we hang.
|
2010-05-29 00:45:14 +04:00
|
|
|
*/
|
2018-10-10 23:48:33 +03:00
|
|
|
deadman_thread = thread_create(NULL, 0, ztest_deadman_thread,
|
2017-12-19 01:06:07 +03:00
|
|
|
zs, 0, NULL, TS_RUN | TS_JOINABLE, defclsyspri);
|
|
|
|
|
|
|
|
spa->spa_deadman_failmode = ZIO_FAILURE_MODE_PANIC;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2016-12-17 01:11:29 +03:00
|
|
|
* Verify that we can safely inquire about any object,
|
2008-11-20 23:01:55 +03:00
|
|
|
* whether it's allocated or not. To make it interesting,
|
|
|
|
* we probe a 5-wide window around each power of two.
|
|
|
|
* This hits all edge cases, including zero and the max.
|
|
|
|
*/
|
2010-08-26 20:52:39 +04:00
|
|
|
for (t = 0; t < 64; t++) {
|
|
|
|
for (d = -5; d <= 5; d++) {
|
2008-11-20 23:01:55 +03:00
|
|
|
error = dmu_object_info(spa->spa_meta_objset,
|
|
|
|
(1ULL << t) + d, NULL);
|
|
|
|
ASSERT(error == 0 || error == ENOENT ||
|
|
|
|
error == EINVAL);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* If we got any ENOSPC errors on the previous run, destroy something.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
if (zs->zs_enospc_count != 0) {
|
2012-01-24 06:43:32 +04:00
|
|
|
int d = ztest_random(ztest_opts.zo_datasets);
|
|
|
|
ztest_dataset_destroy(d);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
zs->zs_enospc_count = 0;
|
|
|
|
|
2018-11-29 07:47:09 +03:00
|
|
|
/*
|
|
|
|
* If we were in the middle of ztest_device_removal() and were killed
|
|
|
|
* we need to ensure the removal and scrub complete before running
|
|
|
|
* any tests that check ztest_device_removal_active. The removal will
|
|
|
|
* be restarted automatically when the spa is opened, but we need to
|
2019-01-13 21:11:52 +03:00
|
|
|
* initiate the scrub manually if it is not already in progress. Note
|
2018-11-29 07:47:09 +03:00
|
|
|
* that we always run the scrub whenever an indirect vdev exists
|
|
|
|
* because we have no way of knowing for sure if ztest_device_removal()
|
|
|
|
* fully completed its scrub before the pool was reimported.
|
|
|
|
*/
|
|
|
|
if (spa->spa_removing_phys.sr_state == DSS_SCANNING ||
|
|
|
|
spa->spa_removing_phys.sr_prev_indirect_vdev != -1) {
|
|
|
|
while (spa->spa_removing_phys.sr_state == DSS_SCANNING)
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
|
2019-01-18 20:47:55 +03:00
|
|
|
error = ztest_scrub_impl(spa);
|
|
|
|
if (error == EBUSY)
|
|
|
|
error = 0;
|
|
|
|
ASSERT0(error);
|
2018-11-29 07:47:09 +03:00
|
|
|
}
|
|
|
|
|
Simplify threads, mutexs, cvs and rwlocks
* Simplify threads, mutexs, cvs and rwlocks
* Update the zk_thread_create() function to use the same trick
as Illumos. Specifically, cast the new pthread_t to a void
pointer and return that as the kthread_t *. This avoids the
issues associated with managing a wrapper structure and is
safe as long as the callers never attempt to dereference it.
* Update all function prototypes passed to pthread_create() to
match the expected prototype. We were getting away this with
before since the function were explicitly cast.
* Replaced direct zk_thread_create() calls with thread_create()
for code consistency. All consumers of libzpool now use the
proper wrappers.
* The mutex_held() calls were converted to MUTEX_HELD().
* Removed all mutex_owner() calls and retired the interface.
Instead use MUTEX_HELD() which provides the same information
and allows the implementation details to be hidden. In this
case the use of the pthread_equals() function.
* The kthread_t, kmutex_t, krwlock_t, and krwlock_t types had
any non essential fields removed. In the case of kthread_t
and kcondvar_t they could be directly typedef'd to pthread_t
and pthread_cond_t respectively.
* Removed all extra ASSERTS from the thread, mutex, rwlock, and
cv wrapper functions. In practice, pthreads already provides
the vast majority of checks as long as we check the return
code. Removing this code from our wrappers help readability.
* Added TS_JOINABLE state flag to pass to request a joinable rather
than detached thread. This isn't a standard thread_create() state
but it's the least invasive way to pass this information and is
only used by ztest.
TEST_ZTEST_TIMEOUT=3600
Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4547
Closes #5503
Closes #5523
Closes #6377
Closes #6495
2017-08-11 18:51:44 +03:00
|
|
|
run_threads = umem_zalloc(ztest_opts.zo_threads * sizeof (kthread_t *),
|
2012-01-24 06:43:32 +04:00
|
|
|
UMEM_NOFAIL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 4)
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("starting main threads...\n");
|
|
|
|
|
2018-11-08 02:40:24 +03:00
|
|
|
/*
|
|
|
|
* Replay all logs of all datasets in the pool. This is primarily for
|
|
|
|
* temporary datasets which wouldn't otherwise get replayed, which
|
|
|
|
* can trigger failures when attempting to offline a SLOG in
|
|
|
|
* ztest_fault_inject().
|
|
|
|
*/
|
|
|
|
(void) dmu_objset_find(ztest_opts.zo_pool, ztest_replay_zil_cb,
|
|
|
|
NULL, DS_FIND_CHILDREN);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Kick off all the tests that run in parallel.
|
|
|
|
*/
|
2012-01-24 06:43:32 +04:00
|
|
|
for (t = 0; t < ztest_opts.zo_threads; t++) {
|
Simplify threads, mutexs, cvs and rwlocks
* Simplify threads, mutexs, cvs and rwlocks
* Update the zk_thread_create() function to use the same trick
as Illumos. Specifically, cast the new pthread_t to a void
pointer and return that as the kthread_t *. This avoids the
issues associated with managing a wrapper structure and is
safe as long as the callers never attempt to dereference it.
* Update all function prototypes passed to pthread_create() to
match the expected prototype. We were getting away this with
before since the function were explicitly cast.
* Replaced direct zk_thread_create() calls with thread_create()
for code consistency. All consumers of libzpool now use the
proper wrappers.
* The mutex_held() calls were converted to MUTEX_HELD().
* Removed all mutex_owner() calls and retired the interface.
Instead use MUTEX_HELD() which provides the same information
and allows the implementation details to be hidden. In this
case the use of the pthread_equals() function.
* The kthread_t, kmutex_t, krwlock_t, and krwlock_t types had
any non essential fields removed. In the case of kthread_t
and kcondvar_t they could be directly typedef'd to pthread_t
and pthread_cond_t respectively.
* Removed all extra ASSERTS from the thread, mutex, rwlock, and
cv wrapper functions. In practice, pthreads already provides
the vast majority of checks as long as we check the return
code. Removing this code from our wrappers help readability.
* Added TS_JOINABLE state flag to pass to request a joinable rather
than detached thread. This isn't a standard thread_create() state
but it's the least invasive way to pass this information and is
only used by ztest.
TEST_ZTEST_TIMEOUT=3600
Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4547
Closes #5503
Closes #5523
Closes #6377
Closes #6495
2017-08-11 18:51:44 +03:00
|
|
|
if (t < ztest_opts.zo_datasets && ztest_dataset_open(t) != 0) {
|
|
|
|
umem_free(run_threads, ztest_opts.zo_threads *
|
|
|
|
sizeof (kthread_t *));
|
2010-05-29 00:45:14 +04:00
|
|
|
return;
|
2016-10-12 21:16:47 +03:00
|
|
|
}
|
2010-08-26 21:43:27 +04:00
|
|
|
|
Simplify threads, mutexs, cvs and rwlocks
* Simplify threads, mutexs, cvs and rwlocks
* Update the zk_thread_create() function to use the same trick
as Illumos. Specifically, cast the new pthread_t to a void
pointer and return that as the kthread_t *. This avoids the
issues associated with managing a wrapper structure and is
safe as long as the callers never attempt to dereference it.
* Update all function prototypes passed to pthread_create() to
match the expected prototype. We were getting away this with
before since the function were explicitly cast.
* Replaced direct zk_thread_create() calls with thread_create()
for code consistency. All consumers of libzpool now use the
proper wrappers.
* The mutex_held() calls were converted to MUTEX_HELD().
* Removed all mutex_owner() calls and retired the interface.
Instead use MUTEX_HELD() which provides the same information
and allows the implementation details to be hidden. In this
case the use of the pthread_equals() function.
* The kthread_t, kmutex_t, krwlock_t, and krwlock_t types had
any non essential fields removed. In the case of kthread_t
and kcondvar_t they could be directly typedef'd to pthread_t
and pthread_cond_t respectively.
* Removed all extra ASSERTS from the thread, mutex, rwlock, and
cv wrapper functions. In practice, pthreads already provides
the vast majority of checks as long as we check the return
code. Removing this code from our wrappers help readability.
* Added TS_JOINABLE state flag to pass to request a joinable rather
than detached thread. This isn't a standard thread_create() state
but it's the least invasive way to pass this information and is
only used by ztest.
TEST_ZTEST_TIMEOUT=3600
Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4547
Closes #5503
Closes #5523
Closes #6377
Closes #6495
2017-08-11 18:51:44 +03:00
|
|
|
run_threads[t] = thread_create(NULL, 0, ztest_thread,
|
|
|
|
(void *)(uintptr_t)t, 0, NULL, TS_RUN | TS_JOINABLE,
|
|
|
|
defclsyspri);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
2018-11-08 02:46:50 +03:00
|
|
|
* Wait for all of the tests to complete.
|
2010-05-29 00:45:14 +04:00
|
|
|
*/
|
2018-11-08 02:46:50 +03:00
|
|
|
for (t = 0; t < ztest_opts.zo_threads; t++)
|
Simplify threads, mutexs, cvs and rwlocks
* Simplify threads, mutexs, cvs and rwlocks
* Update the zk_thread_create() function to use the same trick
as Illumos. Specifically, cast the new pthread_t to a void
pointer and return that as the kthread_t *. This avoids the
issues associated with managing a wrapper structure and is
safe as long as the callers never attempt to dereference it.
* Update all function prototypes passed to pthread_create() to
match the expected prototype. We were getting away this with
before since the function were explicitly cast.
* Replaced direct zk_thread_create() calls with thread_create()
for code consistency. All consumers of libzpool now use the
proper wrappers.
* The mutex_held() calls were converted to MUTEX_HELD().
* Removed all mutex_owner() calls and retired the interface.
Instead use MUTEX_HELD() which provides the same information
and allows the implementation details to be hidden. In this
case the use of the pthread_equals() function.
* The kthread_t, kmutex_t, krwlock_t, and krwlock_t types had
any non essential fields removed. In the case of kthread_t
and kcondvar_t they could be directly typedef'd to pthread_t
and pthread_cond_t respectively.
* Removed all extra ASSERTS from the thread, mutex, rwlock, and
cv wrapper functions. In practice, pthreads already provides
the vast majority of checks as long as we check the return
code. Removing this code from our wrappers help readability.
* Added TS_JOINABLE state flag to pass to request a joinable rather
than detached thread. This isn't a standard thread_create() state
but it's the least invasive way to pass this information and is
only used by ztest.
TEST_ZTEST_TIMEOUT=3600
Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4547
Closes #5503
Closes #5523
Closes #6377
Closes #6495
2017-08-11 18:51:44 +03:00
|
|
|
VERIFY0(thread_join(run_threads[t]));
|
2018-11-08 02:46:50 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Close all datasets. This must be done after all the threads
|
|
|
|
* are joined so we can be sure none of the datasets are in-use
|
|
|
|
* by any of the threads.
|
|
|
|
*/
|
|
|
|
for (t = 0; t < ztest_opts.zo_threads; t++) {
|
2012-01-24 06:43:32 +04:00
|
|
|
if (t < ztest_opts.zo_datasets)
|
|
|
|
ztest_dataset_close(t);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
zs->zs_alloc = metaslab_class_get_alloc(spa_normal_class(spa));
|
|
|
|
zs->zs_space = metaslab_class_get_space(spa_normal_class(spa));
|
|
|
|
|
Simplify threads, mutexs, cvs and rwlocks
* Simplify threads, mutexs, cvs and rwlocks
* Update the zk_thread_create() function to use the same trick
as Illumos. Specifically, cast the new pthread_t to a void
pointer and return that as the kthread_t *. This avoids the
issues associated with managing a wrapper structure and is
safe as long as the callers never attempt to dereference it.
* Update all function prototypes passed to pthread_create() to
match the expected prototype. We were getting away this with
before since the function were explicitly cast.
* Replaced direct zk_thread_create() calls with thread_create()
for code consistency. All consumers of libzpool now use the
proper wrappers.
* The mutex_held() calls were converted to MUTEX_HELD().
* Removed all mutex_owner() calls and retired the interface.
Instead use MUTEX_HELD() which provides the same information
and allows the implementation details to be hidden. In this
case the use of the pthread_equals() function.
* The kthread_t, kmutex_t, krwlock_t, and krwlock_t types had
any non essential fields removed. In the case of kthread_t
and kcondvar_t they could be directly typedef'd to pthread_t
and pthread_cond_t respectively.
* Removed all extra ASSERTS from the thread, mutex, rwlock, and
cv wrapper functions. In practice, pthreads already provides
the vast majority of checks as long as we check the return
code. Removing this code from our wrappers help readability.
* Added TS_JOINABLE state flag to pass to request a joinable rather
than detached thread. This isn't a standard thread_create() state
but it's the least invasive way to pass this information and is
only used by ztest.
TEST_ZTEST_TIMEOUT=3600
Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4547
Closes #5503
Closes #5523
Closes #6377
Closes #6495
2017-08-11 18:51:44 +03:00
|
|
|
umem_free(run_threads, ztest_opts.zo_threads * sizeof (kthread_t *));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2018-10-10 23:48:33 +03:00
|
|
|
/* Kill the resume and deadman threads */
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_exiting = B_TRUE;
|
Simplify threads, mutexs, cvs and rwlocks
* Simplify threads, mutexs, cvs and rwlocks
* Update the zk_thread_create() function to use the same trick
as Illumos. Specifically, cast the new pthread_t to a void
pointer and return that as the kthread_t *. This avoids the
issues associated with managing a wrapper structure and is
safe as long as the callers never attempt to dereference it.
* Update all function prototypes passed to pthread_create() to
match the expected prototype. We were getting away this with
before since the function were explicitly cast.
* Replaced direct zk_thread_create() calls with thread_create()
for code consistency. All consumers of libzpool now use the
proper wrappers.
* The mutex_held() calls were converted to MUTEX_HELD().
* Removed all mutex_owner() calls and retired the interface.
Instead use MUTEX_HELD() which provides the same information
and allows the implementation details to be hidden. In this
case the use of the pthread_equals() function.
* The kthread_t, kmutex_t, krwlock_t, and krwlock_t types had
any non essential fields removed. In the case of kthread_t
and kcondvar_t they could be directly typedef'd to pthread_t
and pthread_cond_t respectively.
* Removed all extra ASSERTS from the thread, mutex, rwlock, and
cv wrapper functions. In practice, pthreads already provides
the vast majority of checks as long as we check the return
code. Removing this code from our wrappers help readability.
* Added TS_JOINABLE state flag to pass to request a joinable rather
than detached thread. This isn't a standard thread_create() state
but it's the least invasive way to pass this information and is
only used by ztest.
TEST_ZTEST_TIMEOUT=3600
Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4547
Closes #5503
Closes #5523
Closes #6377
Closes #6495
2017-08-11 18:51:44 +03:00
|
|
|
VERIFY0(thread_join(resume_thread));
|
2018-10-10 23:48:33 +03:00
|
|
|
VERIFY0(thread_join(deadman_thread));
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_resume(spa);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Right before closing the pool, kick off a bunch of async I/O;
|
|
|
|
* spa_close() should wait for it to complete.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2015-12-22 04:31:57 +03:00
|
|
|
for (object = 1; object < 50; object++) {
|
|
|
|
dmu_prefetch(spa->spa_meta_objset, object, 0, 0, 1ULL << 20,
|
|
|
|
ZIO_PRIORITY_SYNC_READ);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-26 21:17:18 +04:00
|
|
|
/* Verify that at least one commit cb was called in a timely fashion */
|
|
|
|
if (zc_cb_counter >= ZTEST_COMMIT_CB_MIN_REG)
|
2013-05-11 01:17:03 +04:00
|
|
|
VERIFY0(zc_min_txg_delay);
|
2010-08-26 21:17:18 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_close(spa, FTAG);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that we can loop over all pools.
|
|
|
|
*/
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
for (spa = spa_next(NULL); spa != NULL; spa = spa_next(spa))
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose > 3)
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("spa_next: found %s\n", spa_name(spa));
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that we can export the pool and reimport it under a
|
|
|
|
* different name.
|
|
|
|
*/
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
if ((ztest_random(2) == 0) && !ztest_opts.zo_mmp_test) {
|
2016-06-16 00:28:36 +03:00
|
|
|
char name[ZFS_MAX_DATASET_NAME_LEN];
|
|
|
|
(void) snprintf(name, sizeof (name), "%s_import",
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_opts.zo_pool);
|
|
|
|
ztest_spa_import_export(ztest_opts.zo_pool, name);
|
|
|
|
ztest_spa_import_export(name, ztest_opts.zo_pool);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
kernel_fini();
|
2010-08-27 01:24:34 +04:00
|
|
|
|
|
|
|
list_destroy(&zcl.zcl_callbacks);
|
2010-08-26 22:59:11 +04:00
|
|
|
mutex_destroy(&zcl.zcl_callbacks_lock);
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_destroy(&ztest_name_lock);
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_destroy(&ztest_vdev_lock);
|
2016-12-17 01:11:29 +03:00
|
|
|
mutex_destroy(&ztest_checkpoint_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2020-06-15 21:30:37 +03:00
|
|
|
static void
|
2008-11-20 23:01:55 +03:00
|
|
|
print_time(hrtime_t t, char *timebuf)
|
|
|
|
{
|
|
|
|
hrtime_t s = t / NANOSEC;
|
|
|
|
hrtime_t m = s / 60;
|
|
|
|
hrtime_t h = m / 60;
|
|
|
|
hrtime_t d = h / 24;
|
|
|
|
|
|
|
|
s -= m * 60;
|
|
|
|
m -= h * 60;
|
|
|
|
h -= d * 24;
|
|
|
|
|
|
|
|
timebuf[0] = '\0';
|
|
|
|
|
|
|
|
if (d)
|
|
|
|
(void) sprintf(timebuf,
|
|
|
|
"%llud%02lluh%02llum%02llus", d, h, m, s);
|
|
|
|
else if (h)
|
|
|
|
(void) sprintf(timebuf, "%lluh%02llum%02llus", h, m, s);
|
|
|
|
else if (m)
|
|
|
|
(void) sprintf(timebuf, "%llum%02llus", m, s);
|
|
|
|
else
|
|
|
|
(void) sprintf(timebuf, "%llus", s);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static nvlist_t *
|
2010-08-26 20:52:41 +04:00
|
|
|
make_random_props(void)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
|
|
|
nvlist_t *props;
|
|
|
|
|
2021-01-11 20:30:33 +03:00
|
|
|
props = fnvlist_alloc();
|
2018-09-06 04:33:36 +03:00
|
|
|
|
2012-12-14 03:24:15 +04:00
|
|
|
if (ztest_random(2) == 0)
|
|
|
|
return (props);
|
2018-02-05 23:00:26 +03:00
|
|
|
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_add_uint64(props,
|
|
|
|
zpool_prop_to_name(ZPOOL_PROP_AUTOREPLACE), 1);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
return (props);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Create a storage pool with the given name and initial vdev size.
|
2010-05-29 00:45:14 +04:00
|
|
|
* Then test spa_freeze() functionality.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
static void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_init(ztest_shared_t *zs)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
spa_t *spa;
|
2010-05-29 00:45:14 +04:00
|
|
|
nvlist_t *nvroot, *props;
|
2012-12-14 03:24:15 +04:00
|
|
|
int i;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_init(&ztest_vdev_lock, NULL, MUTEX_DEFAULT, NULL);
|
2016-12-17 01:11:29 +03:00
|
|
|
mutex_init(&ztest_checkpoint_lock, NULL, MUTEX_DEFAULT, NULL);
|
2018-06-05 02:52:10 +03:00
|
|
|
VERIFY0(pthread_rwlock_init(&ztest_name_lock, NULL));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2019-11-21 20:32:57 +03:00
|
|
|
kernel_init(SPA_MODE_READ | SPA_MODE_WRITE);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Create the storage pool.
|
|
|
|
*/
|
2012-01-24 06:43:32 +04:00
|
|
|
(void) spa_destroy(ztest_opts.zo_pool);
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_shared->zs_vdev_next_leaf = 0;
|
|
|
|
zs->zs_splits = 0;
|
2012-01-24 06:43:32 +04:00
|
|
|
zs->zs_mirrors = ztest_opts.zo_mirrors;
|
2012-12-15 04:28:49 +04:00
|
|
|
nvroot = make_vdev_root(NULL, NULL, NULL, ztest_opts.zo_vdev_size, 0,
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
NULL, ztest_opts.zo_raid_children, zs->zs_mirrors, 1);
|
2010-05-29 00:45:14 +04:00
|
|
|
props = make_random_props();
|
2018-02-05 23:00:26 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We don't expect the pool to suspend unless maxfaults == 0,
|
|
|
|
* in which case ztest_fault_inject() temporarily takes away
|
|
|
|
* the only valid replica.
|
|
|
|
*/
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_add_uint64(props,
|
2018-02-05 23:00:26 +03:00
|
|
|
zpool_prop_to_name(ZPOOL_PROP_FAILUREMODE),
|
2021-01-11 20:30:33 +03:00
|
|
|
MAXFAULTS(zs) ? ZIO_FAILURE_MODE_PANIC : ZIO_FAILURE_MODE_WAIT);
|
2018-02-05 23:00:26 +03:00
|
|
|
|
2012-12-14 03:24:15 +04:00
|
|
|
for (i = 0; i < SPA_FEATURES; i++) {
|
|
|
|
char *buf;
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
|
2021-02-28 04:16:02 +03:00
|
|
|
if (!spa_feature_table[i].fi_zfs_mod_supported)
|
|
|
|
continue;
|
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
/*
|
|
|
|
* 75% chance of using the log space map feature. We want ztest
|
|
|
|
* to exercise both the code paths that use the log space map
|
|
|
|
* feature and the ones that don't.
|
|
|
|
*/
|
|
|
|
if (i == SPA_FEATURE_LOG_SPACEMAP && ztest_random(4) == 0)
|
|
|
|
continue;
|
|
|
|
|
2012-12-14 03:24:15 +04:00
|
|
|
VERIFY3S(-1, !=, asprintf(&buf, "feature@%s",
|
|
|
|
spa_feature_table[i].fi_uname));
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_add_uint64(props, buf, 0);
|
2012-12-14 03:24:15 +04:00
|
|
|
free(buf);
|
|
|
|
}
|
2018-02-05 23:00:26 +03:00
|
|
|
|
|
|
|
VERIFY0(spa_create(ztest_opts.zo_pool, nvroot, props, NULL, NULL));
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(nvroot);
|
|
|
|
fnvlist_free(props);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(spa_open(ztest_opts.zo_pool, &spa, FTAG));
|
2012-01-24 06:43:32 +04:00
|
|
|
zs->zs_metaslab_sz =
|
|
|
|
1ULL << spa->spa_root_vdev->vdev_child[0]->vdev_ms_shift;
|
2008-11-20 23:01:55 +03:00
|
|
|
spa_close(spa, FTAG);
|
|
|
|
|
|
|
|
kernel_fini();
|
2010-05-29 00:45:14 +04:00
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
if (!ztest_opts.zo_mmp_test) {
|
|
|
|
ztest_run_zdb(ztest_opts.zo_pool);
|
|
|
|
ztest_freeze();
|
|
|
|
ztest_run_zdb(ztest_opts.zo_pool);
|
|
|
|
}
|
2012-01-24 06:43:32 +04:00
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_destroy(&ztest_name_lock);
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_destroy(&ztest_vdev_lock);
|
2016-12-17 01:11:29 +03:00
|
|
|
mutex_destroy(&ztest_checkpoint_lock);
|
2012-01-24 06:43:32 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2012-10-04 18:09:12 +04:00
|
|
|
setup_data_fd(void)
|
2012-01-24 06:43:32 +04:00
|
|
|
{
|
2012-10-04 22:36:52 +04:00
|
|
|
static char ztest_name_data[] = "/tmp/ztest.data.XXXXXX";
|
|
|
|
|
|
|
|
ztest_fd_data = mkstemp(ztest_name_data);
|
2012-10-04 18:09:12 +04:00
|
|
|
ASSERT3S(ztest_fd_data, >=, 0);
|
2012-10-04 22:36:52 +04:00
|
|
|
(void) unlink(ztest_name_data);
|
2012-01-24 06:43:32 +04:00
|
|
|
}
|
|
|
|
|
2012-05-21 23:11:39 +04:00
|
|
|
static int
|
|
|
|
shared_data_size(ztest_shared_hdr_t *hdr)
|
|
|
|
{
|
|
|
|
int size;
|
|
|
|
|
|
|
|
size = hdr->zh_hdr_size;
|
|
|
|
size += hdr->zh_opts_size;
|
|
|
|
size += hdr->zh_size;
|
|
|
|
size += hdr->zh_stats_size * hdr->zh_stats_count;
|
|
|
|
size += hdr->zh_ds_size * hdr->zh_ds_count;
|
|
|
|
|
|
|
|
return (size);
|
|
|
|
}
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
static void
|
|
|
|
setup_hdr(void)
|
|
|
|
{
|
2012-05-21 23:11:39 +04:00
|
|
|
int size;
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_shared_hdr_t *hdr;
|
|
|
|
|
|
|
|
hdr = (void *)mmap(0, P2ROUNDUP(sizeof (*hdr), getpagesize()),
|
2012-10-04 18:09:12 +04:00
|
|
|
PROT_READ | PROT_WRITE, MAP_SHARED, ztest_fd_data, 0);
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3P(hdr, !=, MAP_FAILED);
|
2012-01-24 06:43:32 +04:00
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(ftruncate(ztest_fd_data, sizeof (ztest_shared_hdr_t)));
|
2012-05-21 23:11:39 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
hdr->zh_hdr_size = sizeof (ztest_shared_hdr_t);
|
|
|
|
hdr->zh_opts_size = sizeof (ztest_shared_opts_t);
|
|
|
|
hdr->zh_size = sizeof (ztest_shared_t);
|
|
|
|
hdr->zh_stats_size = sizeof (ztest_shared_callstate_t);
|
|
|
|
hdr->zh_stats_count = ZTEST_FUNCS;
|
|
|
|
hdr->zh_ds_size = sizeof (ztest_shared_ds_t);
|
|
|
|
hdr->zh_ds_count = ztest_opts.zo_datasets;
|
|
|
|
|
2012-05-21 23:11:39 +04:00
|
|
|
size = shared_data_size(hdr);
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(ftruncate(ztest_fd_data, size));
|
2012-05-21 23:11:39 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
(void) munmap((caddr_t)hdr, P2ROUNDUP(sizeof (*hdr), getpagesize()));
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
setup_data(void)
|
|
|
|
{
|
|
|
|
int size, offset;
|
|
|
|
ztest_shared_hdr_t *hdr;
|
|
|
|
uint8_t *buf;
|
|
|
|
|
|
|
|
hdr = (void *)mmap(0, P2ROUNDUP(sizeof (*hdr), getpagesize()),
|
2012-10-04 18:09:12 +04:00
|
|
|
PROT_READ, MAP_SHARED, ztest_fd_data, 0);
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3P(hdr, !=, MAP_FAILED);
|
2012-01-24 06:43:32 +04:00
|
|
|
|
2012-05-21 23:11:39 +04:00
|
|
|
size = shared_data_size(hdr);
|
2012-01-24 06:43:32 +04:00
|
|
|
|
|
|
|
(void) munmap((caddr_t)hdr, P2ROUNDUP(sizeof (*hdr), getpagesize()));
|
|
|
|
hdr = ztest_shared_hdr = (void *)mmap(0, P2ROUNDUP(size, getpagesize()),
|
2012-10-04 18:09:12 +04:00
|
|
|
PROT_READ | PROT_WRITE, MAP_SHARED, ztest_fd_data, 0);
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3P(hdr, !=, MAP_FAILED);
|
2012-01-24 06:43:32 +04:00
|
|
|
buf = (uint8_t *)hdr;
|
|
|
|
|
|
|
|
offset = hdr->zh_hdr_size;
|
|
|
|
ztest_shared_opts = (void *)&buf[offset];
|
|
|
|
offset += hdr->zh_opts_size;
|
|
|
|
ztest_shared = (void *)&buf[offset];
|
|
|
|
offset += hdr->zh_size;
|
|
|
|
ztest_shared_callstate = (void *)&buf[offset];
|
|
|
|
offset += hdr->zh_stats_size * hdr->zh_stats_count;
|
|
|
|
ztest_shared_ds = (void *)&buf[offset];
|
|
|
|
}
|
|
|
|
|
|
|
|
static boolean_t
|
|
|
|
exec_child(char *cmd, char *libpath, boolean_t ignorekill, int *statusp)
|
|
|
|
{
|
|
|
|
pid_t pid;
|
|
|
|
int status;
|
2012-10-04 22:14:04 +04:00
|
|
|
char *cmdbuf = NULL;
|
2012-01-24 06:43:32 +04:00
|
|
|
|
|
|
|
pid = fork();
|
|
|
|
|
|
|
|
if (cmd == NULL) {
|
2012-10-04 22:14:04 +04:00
|
|
|
cmdbuf = umem_alloc(MAXPATHLEN, UMEM_NOFAIL);
|
|
|
|
(void) strlcpy(cmdbuf, getexecname(), MAXPATHLEN);
|
2012-01-24 06:43:32 +04:00
|
|
|
cmd = cmdbuf;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (pid == -1)
|
|
|
|
fatal(1, "fork failed");
|
|
|
|
|
|
|
|
if (pid == 0) { /* child */
|
|
|
|
char *emptyargv[2] = { cmd, NULL };
|
2012-10-04 18:09:12 +04:00
|
|
|
char fd_data_str[12];
|
2012-01-24 06:43:32 +04:00
|
|
|
|
|
|
|
struct rlimit rl = { 1024, 1024 };
|
|
|
|
(void) setrlimit(RLIMIT_NOFILE, &rl);
|
2012-10-04 18:09:12 +04:00
|
|
|
|
|
|
|
(void) close(ztest_fd_rand);
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY3S(11, >=,
|
|
|
|
snprintf(fd_data_str, 12, "%d", ztest_fd_data));
|
|
|
|
VERIFY0(setenv("ZTEST_FD_DATA", fd_data_str, 1));
|
2012-10-04 18:09:12 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
(void) enable_extended_FILE_stdio(-1, -1);
|
|
|
|
if (libpath != NULL)
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(setenv("LD_LIBRARY_PATH", libpath, 1));
|
2012-01-24 06:43:32 +04:00
|
|
|
(void) execv(cmd, emptyargv);
|
|
|
|
ztest_dump_core = B_FALSE;
|
|
|
|
fatal(B_TRUE, "exec failed: %s", cmd);
|
|
|
|
}
|
|
|
|
|
2012-10-04 22:14:04 +04:00
|
|
|
if (cmdbuf != NULL) {
|
|
|
|
umem_free(cmdbuf, MAXPATHLEN);
|
|
|
|
cmd = NULL;
|
|
|
|
}
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
while (waitpid(pid, &status, 0) != pid)
|
|
|
|
continue;
|
|
|
|
if (statusp != NULL)
|
|
|
|
*statusp = status;
|
|
|
|
|
|
|
|
if (WIFEXITED(status)) {
|
|
|
|
if (WEXITSTATUS(status) != 0) {
|
|
|
|
(void) fprintf(stderr, "child exited with code %d\n",
|
|
|
|
WEXITSTATUS(status));
|
|
|
|
exit(2);
|
|
|
|
}
|
|
|
|
return (B_FALSE);
|
|
|
|
} else if (WIFSIGNALED(status)) {
|
|
|
|
if (!ignorekill || WTERMSIG(status) != SIGKILL) {
|
|
|
|
(void) fprintf(stderr, "child died with signal %d\n",
|
|
|
|
WTERMSIG(status));
|
|
|
|
exit(3);
|
|
|
|
}
|
|
|
|
return (B_TRUE);
|
|
|
|
} else {
|
|
|
|
(void) fprintf(stderr, "something strange happened to child\n");
|
|
|
|
exit(4);
|
|
|
|
/* NOTREACHED */
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
ztest_run_init(void)
|
|
|
|
{
|
|
|
|
int i;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_shared_t *zs = ztest_shared;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
/*
|
|
|
|
* Blow away any existing copy of zpool.cache
|
|
|
|
*/
|
|
|
|
(void) remove(spa_config_path);
|
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
if (ztest_opts.zo_init == 0) {
|
|
|
|
if (ztest_opts.zo_verbose >= 1)
|
|
|
|
(void) printf("Importing pool %s\n",
|
|
|
|
ztest_opts.zo_pool);
|
|
|
|
ztest_import(zs);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
/*
|
|
|
|
* Create and initialize our storage pool.
|
|
|
|
*/
|
|
|
|
for (i = 1; i <= ztest_opts.zo_init; i++) {
|
|
|
|
bzero(zs, sizeof (ztest_shared_t));
|
|
|
|
if (ztest_opts.zo_verbose >= 3 &&
|
|
|
|
ztest_opts.zo_init != 1) {
|
|
|
|
(void) printf("ztest_init(), pass %d\n", i);
|
|
|
|
}
|
|
|
|
ztest_init(zs);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
main(int argc, char **argv)
|
|
|
|
{
|
|
|
|
int kills = 0;
|
|
|
|
int iters = 0;
|
2012-01-24 06:43:32 +04:00
|
|
|
int older = 0;
|
|
|
|
int newer = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
ztest_shared_t *zs;
|
|
|
|
ztest_info_t *zi;
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_shared_callstate_t *zc;
|
2008-11-20 23:01:55 +03:00
|
|
|
char timebuf[100];
|
2017-06-13 12:16:45 +03:00
|
|
|
char numbuf[NN_NUMBUF_SZ];
|
2012-10-04 22:14:04 +04:00
|
|
|
char *cmd;
|
2012-01-24 06:43:32 +04:00
|
|
|
boolean_t hasalt;
|
2021-02-16 13:14:44 +03:00
|
|
|
int f, err;
|
2012-10-04 18:09:12 +04:00
|
|
|
char *fd_data_str = getenv("ZTEST_FD_DATA");
|
2014-10-11 05:05:54 +04:00
|
|
|
struct sigaction action;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
(void) setvbuf(stdout, NULL, _IOLBF, 0);
|
|
|
|
|
2013-05-17 01:18:06 +04:00
|
|
|
dprintf_setup(&argc, argv);
|
2017-12-19 01:06:07 +03:00
|
|
|
zfs_deadman_synctime_ms = 300000;
|
2018-10-10 23:48:33 +03:00
|
|
|
zfs_deadman_checktime_ms = 30000;
|
2017-08-04 19:30:49 +03:00
|
|
|
/*
|
|
|
|
* As two-word space map entries may not come up often (especially
|
|
|
|
* if pool and vdev sizes are small) we want to force at least some
|
|
|
|
* of them so the feature get tested.
|
|
|
|
*/
|
|
|
|
zfs_force_some_double_word_sm_entries = B_TRUE;
|
2013-05-17 01:18:06 +04:00
|
|
|
|
2018-10-01 20:36:34 +03:00
|
|
|
/*
|
|
|
|
* Verify that even extensively damaged split blocks with many
|
|
|
|
* segments can be reconstructed in a reasonable amount of time
|
|
|
|
* when reconstruction is known to be possible.
|
2018-10-18 23:53:27 +03:00
|
|
|
*
|
|
|
|
* Note: the lower this value is, the more damage we inflict, and
|
|
|
|
* the more time ztest spends in recovering that damage. We chose
|
|
|
|
* to induce damage 1/100th of the time so recovery is tested but
|
|
|
|
* not so frequently that ztest doesn't get to test other code paths.
|
2018-10-01 20:36:34 +03:00
|
|
|
*/
|
2018-10-18 23:53:27 +03:00
|
|
|
zfs_reconstruct_indirect_damage_fraction = 100;
|
2018-10-01 20:36:34 +03:00
|
|
|
|
2014-10-11 05:05:54 +04:00
|
|
|
action.sa_handler = sig_handler;
|
|
|
|
sigemptyset(&action.sa_mask);
|
|
|
|
action.sa_flags = 0;
|
|
|
|
|
|
|
|
if (sigaction(SIGSEGV, &action, NULL) < 0) {
|
|
|
|
(void) fprintf(stderr, "ztest: cannot catch SIGSEGV: %s.\n",
|
|
|
|
strerror(errno));
|
|
|
|
exit(EXIT_FAILURE);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (sigaction(SIGABRT, &action, NULL) < 0) {
|
|
|
|
(void) fprintf(stderr, "ztest: cannot catch SIGABRT: %s.\n",
|
|
|
|
strerror(errno));
|
|
|
|
exit(EXIT_FAILURE);
|
|
|
|
}
|
|
|
|
|
2018-01-12 20:36:26 +03:00
|
|
|
/*
|
|
|
|
* Force random_get_bytes() to use /dev/urandom in order to prevent
|
|
|
|
* ztest from needlessly depleting the system entropy pool.
|
|
|
|
*/
|
|
|
|
random_path = "/dev/urandom";
|
|
|
|
ztest_fd_rand = open(random_path, O_RDONLY);
|
2012-10-04 18:09:12 +04:00
|
|
|
ASSERT3S(ztest_fd_rand, >=, 0);
|
|
|
|
|
|
|
|
if (!fd_data_str) {
|
2012-01-24 06:43:32 +04:00
|
|
|
process_options(argc, argv);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-10-04 18:09:12 +04:00
|
|
|
setup_data_fd();
|
2012-01-24 06:43:32 +04:00
|
|
|
setup_hdr();
|
|
|
|
setup_data();
|
|
|
|
bcopy(&ztest_opts, ztest_shared_opts,
|
|
|
|
sizeof (*ztest_shared_opts));
|
|
|
|
} else {
|
2012-10-04 18:09:12 +04:00
|
|
|
ztest_fd_data = atoi(fd_data_str);
|
2012-01-24 06:43:32 +04:00
|
|
|
setup_data();
|
|
|
|
bcopy(ztest_shared_opts, &ztest_opts, sizeof (ztest_opts));
|
|
|
|
}
|
|
|
|
ASSERT3U(ztest_opts.zo_datasets, ==, ztest_shared_hdr->zh_ds_count);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-02-16 13:14:44 +03:00
|
|
|
err = ztest_set_global_vars();
|
|
|
|
if (err != 0 && !fd_data_str) {
|
|
|
|
/* error message done by ztest_set_global_vars */
|
|
|
|
exit(EXIT_FAILURE);
|
|
|
|
} else {
|
|
|
|
/* children should not be spawned if setting gvars fails */
|
|
|
|
VERIFY3S(err, ==, 0);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/* Override location of zpool.cache */
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY3S(asprintf((char **)&spa_config_path, "%s/zpool.cache",
|
|
|
|
ztest_opts.zo_dir), !=, -1);
|
2012-10-04 18:09:12 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_ds = umem_alloc(ztest_opts.zo_datasets * sizeof (ztest_ds_t),
|
|
|
|
UMEM_NOFAIL);
|
|
|
|
zs = ztest_shared;
|
|
|
|
|
2012-10-04 18:09:12 +04:00
|
|
|
if (fd_data_str) {
|
2017-07-24 21:07:39 +03:00
|
|
|
metaslab_force_ganging = ztest_opts.zo_metaslab_force_ganging;
|
2012-01-24 06:43:32 +04:00
|
|
|
metaslab_df_alloc_threshold =
|
|
|
|
zs->zs_metaslab_df_alloc_threshold;
|
|
|
|
|
|
|
|
if (zs->zs_do_init)
|
|
|
|
ztest_run_init();
|
|
|
|
else
|
|
|
|
ztest_run(zs);
|
|
|
|
exit(0);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
hasalt = (strlen(ztest_opts.zo_alt_ztest) != 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 1) {
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("%llu vdevs, %d datasets, %d threads,"
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
"%d %s disks, %llu seconds...\n\n",
|
2012-01-24 06:43:32 +04:00
|
|
|
(u_longlong_t)ztest_opts.zo_vdevs,
|
|
|
|
ztest_opts.zo_datasets,
|
|
|
|
ztest_opts.zo_threads,
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
ztest_opts.zo_raid_children,
|
|
|
|
ztest_opts.zo_raid_type,
|
2012-01-24 06:43:32 +04:00
|
|
|
(u_longlong_t)ztest_opts.zo_time);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2012-10-04 22:14:04 +04:00
|
|
|
cmd = umem_alloc(MAXNAMELEN, UMEM_NOFAIL);
|
|
|
|
(void) strlcpy(cmd, getexecname(), MAXNAMELEN);
|
2012-01-24 06:43:32 +04:00
|
|
|
|
|
|
|
zs->zs_do_init = B_TRUE;
|
|
|
|
if (strlen(ztest_opts.zo_alt_ztest) != 0) {
|
|
|
|
if (ztest_opts.zo_verbose >= 1) {
|
|
|
|
(void) printf("Executing older ztest for "
|
|
|
|
"initialization: %s\n", ztest_opts.zo_alt_ztest);
|
|
|
|
}
|
|
|
|
VERIFY(!exec_child(ztest_opts.zo_alt_ztest,
|
|
|
|
ztest_opts.zo_alt_libpath, B_FALSE, NULL));
|
|
|
|
} else {
|
|
|
|
VERIFY(!exec_child(NULL, NULL, B_FALSE, NULL));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2012-01-24 06:43:32 +04:00
|
|
|
zs->zs_do_init = B_FALSE;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
zs->zs_proc_start = gethrtime();
|
2012-01-24 06:43:32 +04:00
|
|
|
zs->zs_proc_stop = zs->zs_proc_start + ztest_opts.zo_time * NANOSEC;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (f = 0; f < ZTEST_FUNCS; f++) {
|
2012-01-24 06:43:32 +04:00
|
|
|
zi = &ztest_info[f];
|
|
|
|
zc = ZTEST_GET_SHARED_CALLSTATE(f);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (zs->zs_proc_start + zi->zi_interval[0] > zs->zs_proc_stop)
|
2012-01-24 06:43:32 +04:00
|
|
|
zc->zc_next = UINT64_MAX;
|
2008-11-20 23:01:55 +03:00
|
|
|
else
|
2012-01-24 06:43:32 +04:00
|
|
|
zc->zc_next = zs->zs_proc_start +
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_random(2 * zi->zi_interval[0] + 1);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Run the tests in a loop. These tests include fault injection
|
|
|
|
* to verify that self-healing data works, and forced crashes
|
|
|
|
* to verify that we never lose on-disk consistency.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
while (gethrtime() < zs->zs_proc_stop) {
|
2008-11-20 23:01:55 +03:00
|
|
|
int status;
|
2012-01-24 06:43:32 +04:00
|
|
|
boolean_t killed;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Initialize the workload counters for each function.
|
|
|
|
*/
|
2010-08-26 20:52:39 +04:00
|
|
|
for (f = 0; f < ZTEST_FUNCS; f++) {
|
2012-01-24 06:43:32 +04:00
|
|
|
zc = ZTEST_GET_SHARED_CALLSTATE(f);
|
|
|
|
zc->zc_count = 0;
|
|
|
|
zc->zc_time = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
/* Set the allocation switch size */
|
2012-01-24 06:43:32 +04:00
|
|
|
zs->zs_metaslab_df_alloc_threshold =
|
|
|
|
ztest_random(zs->zs_metaslab_sz / 4) + 1;
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (!hasalt || ztest_random(2) == 0) {
|
|
|
|
if (hasalt && ztest_opts.zo_verbose >= 1) {
|
|
|
|
(void) printf("Executing newer ztest: %s\n",
|
|
|
|
cmd);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2012-01-24 06:43:32 +04:00
|
|
|
newer++;
|
|
|
|
killed = exec_child(cmd, NULL, B_TRUE, &status);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
2012-01-24 06:43:32 +04:00
|
|
|
if (hasalt && ztest_opts.zo_verbose >= 1) {
|
|
|
|
(void) printf("Executing older ztest: %s\n",
|
|
|
|
ztest_opts.zo_alt_ztest);
|
|
|
|
}
|
|
|
|
older++;
|
|
|
|
killed = exec_child(ztest_opts.zo_alt_ztest,
|
|
|
|
ztest_opts.zo_alt_libpath, B_TRUE, &status);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (killed)
|
|
|
|
kills++;
|
2008-11-20 23:01:55 +03:00
|
|
|
iters++;
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 1) {
|
2008-11-20 23:01:55 +03:00
|
|
|
hrtime_t now = gethrtime();
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
now = MIN(now, zs->zs_proc_stop);
|
|
|
|
print_time(zs->zs_proc_stop - now, timebuf);
|
2017-06-13 12:16:45 +03:00
|
|
|
nicenum(zs->zs_space, numbuf, sizeof (numbuf));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
(void) printf("Pass %3d, %8s, %3llu ENOSPC, "
|
|
|
|
"%4.1f%% of %5s used, %3.0f%% done, %8s to go\n",
|
|
|
|
iters,
|
|
|
|
WIFEXITED(status) ? "Complete" : "SIGKILL",
|
|
|
|
(u_longlong_t)zs->zs_enospc_count,
|
|
|
|
100.0 * zs->zs_alloc / zs->zs_space,
|
|
|
|
numbuf,
|
2010-05-29 00:45:14 +04:00
|
|
|
100.0 * (now - zs->zs_proc_start) /
|
2012-01-24 06:43:32 +04:00
|
|
|
(ztest_opts.zo_time * NANOSEC), timebuf);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 2) {
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("\nWorkload summary:\n\n");
|
|
|
|
(void) printf("%7s %9s %s\n",
|
|
|
|
"Calls", "Time", "Function");
|
|
|
|
(void) printf("%7s %9s %s\n",
|
|
|
|
"-----", "----", "--------");
|
2010-08-26 20:52:39 +04:00
|
|
|
for (f = 0; f < ZTEST_FUNCS; f++) {
|
2012-01-24 06:43:32 +04:00
|
|
|
zi = &ztest_info[f];
|
|
|
|
zc = ZTEST_GET_SHARED_CALLSTATE(f);
|
|
|
|
print_time(zc->zc_time, timebuf);
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("%7llu %9s %s\n",
|
2012-01-24 06:43:32 +04:00
|
|
|
(u_longlong_t)zc->zc_count, timebuf,
|
2015-02-24 21:53:31 +03:00
|
|
|
zi->zi_funcname);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
(void) printf("\n");
|
|
|
|
}
|
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
if (!ztest_opts.zo_mmp_test)
|
|
|
|
ztest_run_zdb(ztest_opts.zo_pool);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 1) {
|
|
|
|
if (hasalt) {
|
|
|
|
(void) printf("%d runs of older ztest: %s\n", older,
|
|
|
|
ztest_opts.zo_alt_ztest);
|
|
|
|
(void) printf("%d runs of newer ztest: %s\n", newer,
|
|
|
|
cmd);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("%d killed, %d completed, %.0f%% kill rate\n",
|
|
|
|
kills, iters - kills, (100.0 * kills) / MAX(1, iters));
|
|
|
|
}
|
|
|
|
|
2012-10-04 22:14:04 +04:00
|
|
|
umem_free(cmd, MAXNAMELEN);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
return (0);
|
|
|
|
}
|