2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* CDDL HEADER START
|
|
|
|
*
|
|
|
|
* The contents of this file are subject to the terms of the
|
|
|
|
* Common Development and Distribution License (the "License").
|
|
|
|
* You may not use this file except in compliance with the License.
|
|
|
|
*
|
|
|
|
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
|
2022-07-12 00:16:13 +03:00
|
|
|
* or https://opensource.org/licenses/CDDL-1.0.
|
2008-11-20 23:01:55 +03:00
|
|
|
* See the License for the specific language governing permissions
|
|
|
|
* and limitations under the License.
|
|
|
|
*
|
|
|
|
* When distributing Covered Code, include this CDDL HEADER in each
|
|
|
|
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
|
|
|
|
* If applicable, add the following below this CDDL HEADER, with the
|
|
|
|
* fields enclosed by brackets "[]" replaced with your own identifying
|
|
|
|
* information: Portions Copyright [yyyy] [name of copyright owner]
|
|
|
|
*
|
|
|
|
* CDDL HEADER END
|
|
|
|
*/
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
|
2024-03-29 22:15:56 +03:00
|
|
|
* Copyright (c) 2011, 2024 by Delphix. All rights reserved.
|
2011-11-12 02:07:54 +04:00
|
|
|
* Copyright 2011 Nexenta Systems, Inc. All rights reserved.
|
2013-05-25 06:06:23 +04:00
|
|
|
* Copyright (c) 2013 Steven Hartland. All rights reserved.
|
2017-06-13 12:16:45 +03:00
|
|
|
* Copyright (c) 2014 Integros [integros.com]
|
|
|
|
* Copyright 2017 Joyent, Inc.
|
2018-09-06 04:33:36 +03:00
|
|
|
* Copyright (c) 2017, Intel Corporation.
|
2024-07-26 19:16:18 +03:00
|
|
|
* Copyright (c) 2023, Klara, Inc.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The objective of this program is to provide a DMU/ZAP/SPA stress test
|
|
|
|
* that runs entirely in userland, is easy to use, and easy to extend.
|
|
|
|
*
|
|
|
|
* The overall design of the ztest program is as follows:
|
|
|
|
*
|
|
|
|
* (1) For each major functional area (e.g. adding vdevs to a pool,
|
|
|
|
* creating and destroying datasets, reading and writing objects, etc)
|
|
|
|
* we have a simple routine to test that functionality. These
|
|
|
|
* individual routines do not have to do anything "stressful".
|
|
|
|
*
|
|
|
|
* (2) We turn these simple functionality tests into a stress test by
|
|
|
|
* running them all in parallel, with as many threads as desired,
|
|
|
|
* and spread across as many datasets, objects, and vdevs as desired.
|
|
|
|
*
|
|
|
|
* (3) While all this is happening, we inject faults into the pool to
|
|
|
|
* verify that self-healing data really works.
|
|
|
|
*
|
|
|
|
* (4) Every time we open a dataset, we change its checksum and compression
|
|
|
|
* functions. Thus even individual objects vary from block to block
|
|
|
|
* in which checksum they use and whether they're compressed.
|
|
|
|
*
|
|
|
|
* (5) To verify that we never lose on-disk consistency after a crash,
|
|
|
|
* we run the entire test in a child of the main process.
|
|
|
|
* At random times, the child self-immolates with a SIGKILL.
|
|
|
|
* This is the software equivalent of pulling the power cord.
|
|
|
|
* The parent then runs the test again, using the existing
|
2014-06-06 01:19:08 +04:00
|
|
|
* storage pool, as many times as desired. If backwards compatibility
|
2012-01-24 06:43:32 +04:00
|
|
|
* testing is enabled ztest will sometimes run the "older" version
|
|
|
|
* of ztest after a SIGKILL.
|
2008-11-20 23:01:55 +03:00
|
|
|
*
|
|
|
|
* (6) To verify that we don't have future leaks or temporal incursions,
|
|
|
|
* many of the functional tests record the transaction group number
|
|
|
|
* as part of their data. When reading old data, they verify that
|
|
|
|
* the transaction group number is less than the current, open txg.
|
|
|
|
* If you add a new test, please do this if applicable.
|
|
|
|
*
|
2010-08-26 21:43:27 +04:00
|
|
|
* (7) Threads are created with a reduced stack size, for sanity checking.
|
|
|
|
* Therefore, it's important not to allocate huge buffers on the stack.
|
|
|
|
*
|
2008-11-20 23:01:55 +03:00
|
|
|
* When run with no arguments, ztest runs for about five minutes and
|
|
|
|
* produces no output if successful. To get a little bit of information,
|
|
|
|
* specify -V. To get more information, specify -VV, and so on.
|
|
|
|
*
|
|
|
|
* To turn this into an overnight stress test, use -T to specify run time.
|
|
|
|
*
|
2019-08-30 19:43:30 +03:00
|
|
|
* You can ask more vdevs [-v], datasets [-d], or threads [-t]
|
2008-11-20 23:01:55 +03:00
|
|
|
* to increase the pool capacity, fanout, and overall stress level.
|
|
|
|
*
|
2012-01-24 06:43:32 +04:00
|
|
|
* Use the -k option to set the desired frequency of kills.
|
|
|
|
*
|
|
|
|
* When ztest invokes itself it passes all relevant information through a
|
|
|
|
* temporary file which is mmap-ed in the child process. This allows shared
|
|
|
|
* memory to survive the exec syscall. The ztest_shared_hdr_t struct is always
|
|
|
|
* stored at offset 0 of this file and contains information on the size and
|
|
|
|
* number of shared structures in the file. The information stored in this file
|
|
|
|
* must remain backwards compatible with older versions of ztest so that
|
|
|
|
* ztest can invoke them during backwards compatibility testing (-B).
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
|
|
|
|
#include <sys/zfs_context.h>
|
|
|
|
#include <sys/spa.h>
|
|
|
|
#include <sys/dmu.h>
|
|
|
|
#include <sys/txg.h>
|
2009-07-03 02:44:48 +04:00
|
|
|
#include <sys/dbuf.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <sys/zap.h>
|
|
|
|
#include <sys/dmu_objset.h>
|
|
|
|
#include <sys/poll.h>
|
|
|
|
#include <sys/stat.h>
|
|
|
|
#include <sys/time.h>
|
|
|
|
#include <sys/wait.h>
|
|
|
|
#include <sys/mman.h>
|
|
|
|
#include <sys/resource.h>
|
|
|
|
#include <sys/zio.h>
|
|
|
|
#include <sys/zil.h>
|
2010-05-29 00:45:14 +04:00
|
|
|
#include <sys/zil_impl.h>
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
#include <sys/vdev_draid.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <sys/vdev_impl.h>
|
2008-12-03 23:09:06 +03:00
|
|
|
#include <sys/vdev_file.h>
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
#include <sys/vdev_initialize.h>
|
2019-07-12 19:31:20 +03:00
|
|
|
#include <sys/vdev_raidz.h>
|
2019-03-29 19:13:20 +03:00
|
|
|
#include <sys/vdev_trim.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <sys/spa_impl.h>
|
2010-05-29 00:45:14 +04:00
|
|
|
#include <sys/metaslab_impl.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <sys/dsl_prop.h>
|
2009-07-03 02:44:48 +04:00
|
|
|
#include <sys/dsl_dataset.h>
|
2013-09-04 16:00:57 +04:00
|
|
|
#include <sys/dsl_destroy.h>
|
2010-05-29 00:45:14 +04:00
|
|
|
#include <sys/dsl_scan.h>
|
|
|
|
#include <sys/zio_checksum.h>
|
2020-07-30 02:35:33 +03:00
|
|
|
#include <sys/zfs_refcount.h>
|
2012-12-14 03:24:15 +04:00
|
|
|
#include <sys/zfeature.h>
|
2013-09-04 16:00:57 +04:00
|
|
|
#include <sys/dsl_userhold.h>
|
2016-07-22 18:52:49 +03:00
|
|
|
#include <sys/abd.h>
|
Introduce BLAKE3 checksums as an OpenZFS feature
This commit adds BLAKE3 checksums to OpenZFS, it has similar
performance to Edon-R, but without the caveats around the latter.
Homepage of BLAKE3: https://github.com/BLAKE3-team/BLAKE3
Wikipedia: https://en.wikipedia.org/wiki/BLAKE_(hash_function)#BLAKE3
Short description of Wikipedia:
BLAKE3 is a cryptographic hash function based on Bao and BLAKE2,
created by Jack O'Connor, Jean-Philippe Aumasson, Samuel Neves, and
Zooko Wilcox-O'Hearn. It was announced on January 9, 2020, at Real
World Crypto. BLAKE3 is a single algorithm with many desirable
features (parallelism, XOF, KDF, PRF and MAC), in contrast to BLAKE
and BLAKE2, which are algorithm families with multiple variants.
BLAKE3 has a binary tree structure, so it supports a practically
unlimited degree of parallelism (both SIMD and multithreading) given
enough input. The official Rust and C implementations are
dual-licensed as public domain (CC0) and the Apache License.
Along with adding the BLAKE3 hash into the OpenZFS infrastructure a
new benchmarking file called chksum_bench was introduced. When read
it reports the speed of the available checksum functions.
On Linux: cat /proc/spl/kstat/zfs/chksum_bench
On FreeBSD: sysctl kstat.zfs.misc.chksum_bench
This is an example output of an i3-1005G1 test system with Debian 11:
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1196 1602 1761 1749 1762 1759 1751
skein-generic 546 591 608 615 619 612 616
sha256-generic 240 300 316 314 304 285 276
sha512-generic 353 441 467 476 472 467 426
blake3-generic 308 313 313 313 312 313 312
blake3-sse2 402 1289 1423 1446 1432 1458 1413
blake3-sse41 427 1470 1625 1704 1679 1607 1629
blake3-avx2 428 1920 3095 3343 3356 3318 3204
blake3-avx512 473 2687 4905 5836 5844 5643 5374
Output on Debian 5.10.0-10-amd64 system: (Ryzen 7 5800X)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1840 2458 2665 2719 2711 2723 2693
skein-generic 870 966 996 992 1003 1005 1009
sha256-generic 415 442 453 455 457 457 457
sha512-generic 608 690 711 718 719 720 721
blake3-generic 301 313 311 309 309 310 310
blake3-sse2 343 1865 2124 2188 2180 2181 2186
blake3-sse41 364 2091 2396 2509 2463 2482 2488
blake3-avx2 365 2590 4399 4971 4915 4802 4764
Output on Debian 5.10.0-9-powerpc64le system: (POWER 9)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1213 1703 1889 1918 1957 1902 1907
skein-generic 434 492 520 522 511 525 525
sha256-generic 167 183 187 188 188 187 188
sha512-generic 186 216 222 221 225 224 224
blake3-generic 153 152 154 153 151 153 153
blake3-sse2 391 1170 1366 1406 1428 1426 1414
blake3-sse41 352 1049 1212 1174 1262 1258 1259
Output on Debian 5.10.0-11-arm64 system: (Pi400)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 487 603 629 639 643 641 641
skein-generic 271 299 303 308 309 309 307
sha256-generic 117 127 128 130 130 129 130
sha512-generic 145 165 170 172 173 174 175
blake3-generic 81 29 71 89 89 89 89
blake3-sse2 112 323 368 379 380 371 374
blake3-sse41 101 315 357 368 369 364 360
Structurally, the new code is mainly split into these parts:
- 1x cross platform generic c variant: blake3_generic.c
- 4x assembly for X86-64 (SSE2, SSE4.1, AVX2, AVX512)
- 2x assembly for ARMv8 (NEON converted from SSE2)
- 2x assembly for PPC64-LE (POWER8 converted from SSE2)
- one file for switching between the implementations
Note the PPC64 assembly requires the VSX instruction set and the
kfpu_begin() / kfpu_end() calls on PowerPC were updated accordingly.
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Co-authored-by: Rich Ercolani <rincebrain@gmail.com>
Closes #10058
Closes #12918
2022-06-09 01:55:57 +03:00
|
|
|
#include <sys/blake3.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <stdio.h>
|
|
|
|
#include <stdlib.h>
|
|
|
|
#include <unistd.h>
|
2021-05-29 01:06:07 +03:00
|
|
|
#include <getopt.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <signal.h>
|
|
|
|
#include <umem.h>
|
|
|
|
#include <ctype.h>
|
|
|
|
#include <math.h>
|
|
|
|
#include <sys/fs/zfs.h>
|
2015-12-10 02:34:16 +03:00
|
|
|
#include <zfs_fletcher.h>
|
2010-05-29 00:45:14 +04:00
|
|
|
#include <libnvpair.h>
|
2018-11-05 22:22:33 +03:00
|
|
|
#include <libzutil.h>
|
2018-08-02 21:59:24 +03:00
|
|
|
#include <sys/crypto/icp.h>
|
2023-02-27 18:14:37 +03:00
|
|
|
#include <sys/zfs_impl.h>
|
2024-05-10 04:26:11 +03:00
|
|
|
#include <sys/backtrace.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-10-04 18:09:12 +04:00
|
|
|
static int ztest_fd_data = -1;
|
|
|
|
static int ztest_fd_rand = -1;
|
2012-01-24 06:43:32 +04:00
|
|
|
|
|
|
|
typedef struct ztest_shared_hdr {
|
|
|
|
uint64_t zh_hdr_size;
|
|
|
|
uint64_t zh_opts_size;
|
|
|
|
uint64_t zh_size;
|
|
|
|
uint64_t zh_stats_size;
|
|
|
|
uint64_t zh_stats_count;
|
|
|
|
uint64_t zh_ds_size;
|
|
|
|
uint64_t zh_ds_count;
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
uint64_t zh_scratch_state_size;
|
2012-01-24 06:43:32 +04:00
|
|
|
} ztest_shared_hdr_t;
|
|
|
|
|
|
|
|
static ztest_shared_hdr_t *ztest_shared_hdr;
|
|
|
|
|
2018-09-06 04:33:36 +03:00
|
|
|
enum ztest_class_state {
|
|
|
|
ZTEST_VDEV_CLASS_OFF,
|
|
|
|
ZTEST_VDEV_CLASS_ON,
|
|
|
|
ZTEST_VDEV_CLASS_RND
|
|
|
|
};
|
|
|
|
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
/* Dedicated RAIDZ Expansion test states */
|
|
|
|
typedef enum {
|
|
|
|
RAIDZ_EXPAND_NONE, /* Default is none, must opt-in */
|
|
|
|
RAIDZ_EXPAND_REQUESTED, /* The '-X' option was used */
|
|
|
|
RAIDZ_EXPAND_STARTED, /* Testing has commenced */
|
|
|
|
RAIDZ_EXPAND_KILLED, /* Reached the proccess kill */
|
|
|
|
RAIDZ_EXPAND_CHECKED, /* Pool scrub verification done */
|
|
|
|
} raidz_expand_test_state_t;
|
|
|
|
|
|
|
|
|
2021-02-16 13:14:44 +03:00
|
|
|
#define ZO_GVARS_MAX_ARGLEN ((size_t)64)
|
|
|
|
#define ZO_GVARS_MAX_COUNT ((size_t)10)
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
typedef struct ztest_shared_opts {
|
2016-06-16 00:28:36 +03:00
|
|
|
char zo_pool[ZFS_MAX_DATASET_NAME_LEN];
|
|
|
|
char zo_dir[ZFS_MAX_DATASET_NAME_LEN];
|
2012-01-24 06:43:32 +04:00
|
|
|
char zo_alt_ztest[MAXNAMELEN];
|
|
|
|
char zo_alt_libpath[MAXNAMELEN];
|
|
|
|
uint64_t zo_vdevs;
|
|
|
|
uint64_t zo_vdevtime;
|
|
|
|
size_t zo_vdev_size;
|
|
|
|
int zo_ashift;
|
|
|
|
int zo_mirrors;
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
int zo_raid_do_expand;
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
int zo_raid_children;
|
|
|
|
int zo_raid_parity;
|
|
|
|
char zo_raid_type[8];
|
|
|
|
int zo_draid_data;
|
|
|
|
int zo_draid_spares;
|
2012-01-24 06:43:32 +04:00
|
|
|
int zo_datasets;
|
|
|
|
int zo_threads;
|
|
|
|
uint64_t zo_passtime;
|
|
|
|
uint64_t zo_killrate;
|
|
|
|
int zo_verbose;
|
|
|
|
int zo_init;
|
|
|
|
uint64_t zo_time;
|
|
|
|
uint64_t zo_maxloops;
|
2017-07-24 21:07:39 +03:00
|
|
|
uint64_t zo_metaslab_force_ganging;
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
raidz_expand_test_state_t zo_raidz_expand_test;
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
int zo_mmp_test;
|
2018-09-06 04:33:36 +03:00
|
|
|
int zo_special_vdevs;
|
2018-10-15 22:14:22 +03:00
|
|
|
int zo_dump_dbgmsg;
|
2021-02-16 13:14:44 +03:00
|
|
|
int zo_gvars_count;
|
|
|
|
char zo_gvars[ZO_GVARS_MAX_COUNT][ZO_GVARS_MAX_ARGLEN];
|
2012-01-24 06:43:32 +04:00
|
|
|
} ztest_shared_opts_t;
|
|
|
|
|
2021-05-29 01:06:07 +03:00
|
|
|
/* Default values for command line options. */
|
|
|
|
#define DEFAULT_POOL "ztest"
|
|
|
|
#define DEFAULT_VDEV_DIR "/tmp"
|
|
|
|
#define DEFAULT_VDEV_COUNT 5
|
|
|
|
#define DEFAULT_VDEV_SIZE (SPA_MINDEVSIZE * 4) /* 256m default size */
|
|
|
|
#define DEFAULT_VDEV_SIZE_STR "256M"
|
|
|
|
#define DEFAULT_ASHIFT SPA_MINBLOCKSHIFT
|
|
|
|
#define DEFAULT_MIRRORS 2
|
|
|
|
#define DEFAULT_RAID_CHILDREN 4
|
|
|
|
#define DEFAULT_RAID_PARITY 1
|
|
|
|
#define DEFAULT_DRAID_DATA 4
|
|
|
|
#define DEFAULT_DRAID_SPARES 1
|
|
|
|
#define DEFAULT_DATASETS_COUNT 7
|
|
|
|
#define DEFAULT_THREADS 23
|
|
|
|
#define DEFAULT_RUN_TIME 300 /* 300 seconds */
|
|
|
|
#define DEFAULT_RUN_TIME_STR "300 sec"
|
|
|
|
#define DEFAULT_PASS_TIME 60 /* 60 seconds */
|
|
|
|
#define DEFAULT_PASS_TIME_STR "60 sec"
|
|
|
|
#define DEFAULT_KILL_RATE 70 /* 70% kill rate */
|
|
|
|
#define DEFAULT_KILLRATE_STR "70%"
|
|
|
|
#define DEFAULT_INITS 1
|
|
|
|
#define DEFAULT_MAX_LOOPS 50 /* 5 minutes */
|
|
|
|
#define DEFAULT_FORCE_GANGING (64 << 10)
|
|
|
|
#define DEFAULT_FORCE_GANGING_STR "64K"
|
|
|
|
|
|
|
|
/* Simplifying assumption: -1 is not a valid default. */
|
|
|
|
#define NO_DEFAULT -1
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
static const ztest_shared_opts_t ztest_opts_defaults = {
|
2021-05-29 01:06:07 +03:00
|
|
|
.zo_pool = DEFAULT_POOL,
|
|
|
|
.zo_dir = DEFAULT_VDEV_DIR,
|
2012-01-24 06:43:32 +04:00
|
|
|
.zo_alt_ztest = { '\0' },
|
|
|
|
.zo_alt_libpath = { '\0' },
|
2021-05-29 01:06:07 +03:00
|
|
|
.zo_vdevs = DEFAULT_VDEV_COUNT,
|
|
|
|
.zo_ashift = DEFAULT_ASHIFT,
|
|
|
|
.zo_mirrors = DEFAULT_MIRRORS,
|
|
|
|
.zo_raid_children = DEFAULT_RAID_CHILDREN,
|
|
|
|
.zo_raid_parity = DEFAULT_RAID_PARITY,
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
.zo_raid_type = VDEV_TYPE_RAIDZ,
|
2021-05-29 01:06:07 +03:00
|
|
|
.zo_vdev_size = DEFAULT_VDEV_SIZE,
|
|
|
|
.zo_draid_data = DEFAULT_DRAID_DATA, /* data drives */
|
|
|
|
.zo_draid_spares = DEFAULT_DRAID_SPARES, /* distributed spares */
|
|
|
|
.zo_datasets = DEFAULT_DATASETS_COUNT,
|
|
|
|
.zo_threads = DEFAULT_THREADS,
|
|
|
|
.zo_passtime = DEFAULT_PASS_TIME,
|
|
|
|
.zo_killrate = DEFAULT_KILL_RATE,
|
2012-01-24 06:43:32 +04:00
|
|
|
.zo_verbose = 0,
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
.zo_mmp_test = 0,
|
2021-05-29 01:06:07 +03:00
|
|
|
.zo_init = DEFAULT_INITS,
|
|
|
|
.zo_time = DEFAULT_RUN_TIME,
|
|
|
|
.zo_maxloops = DEFAULT_MAX_LOOPS, /* max loops during spa_freeze() */
|
|
|
|
.zo_metaslab_force_ganging = DEFAULT_FORCE_GANGING,
|
2018-09-06 04:33:36 +03:00
|
|
|
.zo_special_vdevs = ZTEST_VDEV_CLASS_RND,
|
2021-02-16 13:14:44 +03:00
|
|
|
.zo_gvars_count = 0,
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
.zo_raidz_expand_test = RAIDZ_EXPAND_NONE,
|
2012-01-24 06:43:32 +04:00
|
|
|
};
|
|
|
|
|
2017-07-24 21:07:39 +03:00
|
|
|
extern uint64_t metaslab_force_ganging;
|
2012-01-24 06:43:32 +04:00
|
|
|
extern uint64_t metaslab_df_alloc_threshold;
|
Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
* dbuf_cache_max_bytes
* dbuf_metadata_cache_max_bytes
* l2arc_feed_min_ms
* l2arc_feed_secs
* l2arc_headroom
* l2arc_headroom_boost
* l2arc_write_boost
* l2arc_write_max
* metaslab_aliquot
* metaslab_force_ganging
* zfetch_array_rd_sz
* zfs_arc_max
* zfs_arc_meta_limit
* zfs_arc_meta_min
* zfs_arc_min
* zfs_async_block_max_blocks
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
* zfs_deadman_checktime_ms
* zfs_deadman_synctime_ms
* zfs_initialize_chunk_size
* zfs_initialize_value
* zfs_lua_max_instrlimit
* zfs_lua_max_memlimit
* zil_slog_bulk
Parameters that were originally 32-bit on Illumos:
* zfs_per_txg_dirty_frees_percent
Parameters that were originally `ssize_t` on Illumos:
* zfs_immediate_write_sz
Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.
Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:
* l2arc_rebuild_blocks_min_l2size
* zfs_key_max_salt_uses
* zfs_max_log_walking
* zfs_max_logsm_summary_length
* zfs_metaslab_max_size_cache_sec
* zfs_min_metaslabs_to_flush
* zfs_multihost_interval
* zfs_unflushed_log_block_max
* zfs_unflushed_log_block_min
* zfs_unflushed_log_block_pct
* zfs_unflushed_max_mem_amt
* zfs_unflushed_max_mem_ppm
New parameters that do not exist in Illumos:
* l2arc_trim_ahead
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_arc_sys_free
* zfs_deadman_ziotime_ms
* zfs_delete_blocks
* zfs_history_output_max
* zfs_livelist_max_entries
* zfs_max_async_dedup_frees
* zfs_max_nvlist_src_size
* zfs_rebuild_max_segment
* zfs_rebuild_vdev_limit
* zfs_unflushed_log_txg_max
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
* zfs_vnops_read_chunk_size
* zvol_max_discard_blocks
Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.
A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:
* zfs_delete_blocks
* zfs_key_max_salt_uses
* zvol_max_discard_blocks
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_per_txg_dirty_frees_percent
* zfs_unflushed_log_block_pct
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
2022-10-03 22:06:54 +03:00
|
|
|
extern uint64_t zfs_deadman_synctime_ms;
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
extern uint_t metaslab_preload_limit;
|
2022-01-15 02:37:55 +03:00
|
|
|
extern int zfs_compressed_arc_enabled;
|
2018-01-19 12:19:47 +03:00
|
|
|
extern int zfs_abd_scatter_enabled;
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
extern uint_t dmu_object_alloc_chunk_shift;
|
2017-08-04 19:30:49 +03:00
|
|
|
extern boolean_t zfs_force_some_double_word_sm_entries;
|
2018-08-29 21:33:33 +03:00
|
|
|
extern unsigned long zio_decompress_fail_fraction;
|
2018-10-01 20:36:34 +03:00
|
|
|
extern unsigned long zfs_reconstruct_indirect_damage_fraction;
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
extern uint64_t raidz_expand_max_reflow_bytes;
|
|
|
|
extern uint_t raidz_expand_pause_point;
|
2024-06-18 01:35:18 +03:00
|
|
|
extern boolean_t ddt_prune_artificial_age;
|
|
|
|
extern boolean_t ddt_dump_prune_histogram;
|
2019-01-02 22:46:04 +03:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
|
|
|
|
static ztest_shared_opts_t *ztest_shared_opts;
|
|
|
|
static ztest_shared_opts_t ztest_opts;
|
2022-04-19 21:38:30 +03:00
|
|
|
static const char *const ztest_wkeydata = "abcdefghijklmnopqrstuvwxyz012345";
|
2012-01-24 06:43:32 +04:00
|
|
|
|
|
|
|
typedef struct ztest_shared_ds {
|
|
|
|
uint64_t zd_seq;
|
|
|
|
} ztest_shared_ds_t;
|
|
|
|
|
|
|
|
static ztest_shared_ds_t *ztest_shared_ds;
|
|
|
|
#define ZTEST_GET_SHARED_DS(d) (&ztest_shared_ds[d])
|
2010-05-29 00:45:14 +04:00
|
|
|
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
typedef struct ztest_scratch_state {
|
|
|
|
uint64_t zs_raidz_scratch_verify_pause;
|
|
|
|
} ztest_shared_scratch_state_t;
|
|
|
|
|
|
|
|
static ztest_shared_scratch_state_t *ztest_scratch_state;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
#define BT_MAGIC 0x123456789abcdefULL
|
2018-02-05 23:00:26 +03:00
|
|
|
#define MAXFAULTS(zs) \
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
(MAX((zs)->zs_mirrors, 1) * (ztest_opts.zo_raid_parity + 1) - 1)
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
enum ztest_io_type {
|
|
|
|
ZTEST_IO_WRITE_TAG,
|
|
|
|
ZTEST_IO_WRITE_PATTERN,
|
|
|
|
ZTEST_IO_WRITE_ZEROES,
|
|
|
|
ZTEST_IO_TRUNCATE,
|
|
|
|
ZTEST_IO_SETATTR,
|
2013-05-10 23:47:54 +04:00
|
|
|
ZTEST_IO_REWRITE,
|
2010-05-29 00:45:14 +04:00
|
|
|
ZTEST_IO_TYPES
|
|
|
|
};
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
typedef struct ztest_block_tag {
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t bt_magic;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t bt_objset;
|
|
|
|
uint64_t bt_object;
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
uint64_t bt_dnodesize;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t bt_offset;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t bt_gen;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t bt_txg;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t bt_crtxg;
|
2008-11-20 23:01:55 +03:00
|
|
|
} ztest_block_tag_t;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
typedef struct bufwad {
|
|
|
|
uint64_t bw_index;
|
|
|
|
uint64_t bw_txg;
|
|
|
|
uint64_t bw_data;
|
|
|
|
} bufwad_t;
|
|
|
|
|
2018-10-02 01:13:12 +03:00
|
|
|
/*
|
|
|
|
* It would be better to use a rangelock_t per object. Unfortunately
|
|
|
|
* the rangelock_t is not a drop-in replacement for rl_t, because we
|
|
|
|
* still need to map from object ID to rangelock_t.
|
|
|
|
*/
|
|
|
|
typedef enum {
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
ZTRL_READER,
|
|
|
|
ZTRL_WRITER,
|
|
|
|
ZTRL_APPEND
|
2018-10-02 01:13:12 +03:00
|
|
|
} rl_type_t;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
typedef struct rll {
|
|
|
|
void *rll_writer;
|
|
|
|
int rll_readers;
|
2010-08-26 21:43:27 +04:00
|
|
|
kmutex_t rll_lock;
|
|
|
|
kcondvar_t rll_cv;
|
2010-05-29 00:45:14 +04:00
|
|
|
} rll_t;
|
|
|
|
|
2018-10-02 01:13:12 +03:00
|
|
|
typedef struct rl {
|
|
|
|
uint64_t rl_object;
|
|
|
|
uint64_t rl_offset;
|
|
|
|
uint64_t rl_size;
|
|
|
|
rll_t *rl_lock;
|
|
|
|
} rl_t;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
#define ZTEST_RANGE_LOCKS 64
|
|
|
|
#define ZTEST_OBJECT_LOCKS 64
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Object descriptor. Used as a template for object lookup/create/remove.
|
|
|
|
*/
|
|
|
|
typedef struct ztest_od {
|
|
|
|
uint64_t od_dir;
|
|
|
|
uint64_t od_object;
|
|
|
|
dmu_object_type_t od_type;
|
|
|
|
dmu_object_type_t od_crtype;
|
|
|
|
uint64_t od_blocksize;
|
|
|
|
uint64_t od_crblocksize;
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
uint64_t od_crdnodesize;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t od_gen;
|
|
|
|
uint64_t od_crgen;
|
2016-06-16 00:28:36 +03:00
|
|
|
char od_name[ZFS_MAX_DATASET_NAME_LEN];
|
2010-05-29 00:45:14 +04:00
|
|
|
} ztest_od_t;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Per-dataset state.
|
|
|
|
*/
|
|
|
|
typedef struct ztest_ds {
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_shared_ds_t *zd_shared;
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *zd_os;
|
2018-06-05 02:52:10 +03:00
|
|
|
pthread_rwlock_t zd_zilog_lock;
|
2010-05-29 00:45:14 +04:00
|
|
|
zilog_t *zd_zilog;
|
|
|
|
ztest_od_t *zd_od; /* debugging aid */
|
2016-06-16 00:28:36 +03:00
|
|
|
char zd_name[ZFS_MAX_DATASET_NAME_LEN];
|
2010-08-26 21:43:27 +04:00
|
|
|
kmutex_t zd_dirobj_lock;
|
2010-05-29 00:45:14 +04:00
|
|
|
rll_t zd_object_lock[ZTEST_OBJECT_LOCKS];
|
2018-10-02 01:13:12 +03:00
|
|
|
rll_t zd_range_lock[ZTEST_RANGE_LOCKS];
|
2010-05-29 00:45:14 +04:00
|
|
|
} ztest_ds_t;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Per-iteration state.
|
|
|
|
*/
|
|
|
|
typedef void ztest_func_t(ztest_ds_t *zd, uint64_t id);
|
|
|
|
|
|
|
|
typedef struct ztest_info {
|
|
|
|
ztest_func_t *zi_func; /* test function */
|
|
|
|
uint64_t zi_iters; /* iterations per execution */
|
|
|
|
uint64_t *zi_interval; /* execute every <interval> seconds */
|
2015-02-24 21:53:31 +03:00
|
|
|
const char *zi_funcname; /* name of test function */
|
2010-05-29 00:45:14 +04:00
|
|
|
} ztest_info_t;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
typedef struct ztest_shared_callstate {
|
|
|
|
uint64_t zc_count; /* per-pass count */
|
|
|
|
uint64_t zc_time; /* per-pass time */
|
|
|
|
uint64_t zc_next; /* next time to call this function */
|
|
|
|
} ztest_shared_callstate_t;
|
|
|
|
|
|
|
|
static ztest_shared_callstate_t *ztest_shared_callstate;
|
|
|
|
#define ZTEST_GET_SHARED_CALLSTATE(c) (&ztest_shared_callstate[c])
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
ztest_func_t ztest_dmu_read_write;
|
|
|
|
ztest_func_t ztest_dmu_write_parallel;
|
|
|
|
ztest_func_t ztest_dmu_object_alloc_free;
|
2018-01-19 12:19:47 +03:00
|
|
|
ztest_func_t ztest_dmu_object_next_chunk;
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_func_t ztest_dmu_commit_callbacks;
|
2008-11-20 23:01:55 +03:00
|
|
|
ztest_func_t ztest_zap;
|
|
|
|
ztest_func_t ztest_zap_parallel;
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_func_t ztest_zil_commit;
|
2011-07-26 23:41:53 +04:00
|
|
|
ztest_func_t ztest_zil_remount;
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_func_t ztest_dmu_read_write_zcopy;
|
2008-11-20 23:01:55 +03:00
|
|
|
ztest_func_t ztest_dmu_objset_create_destroy;
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_func_t ztest_dmu_prealloc;
|
|
|
|
ztest_func_t ztest_fzap;
|
2008-11-20 23:01:55 +03:00
|
|
|
ztest_func_t ztest_dmu_snapshot_create_destroy;
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_func_t ztest_dsl_prop_get_set;
|
|
|
|
ztest_func_t ztest_spa_prop_get_set;
|
2008-11-20 23:01:55 +03:00
|
|
|
ztest_func_t ztest_spa_create_destroy;
|
|
|
|
ztest_func_t ztest_fault_inject;
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_func_t ztest_dmu_snapshot_hold;
|
2017-07-21 03:54:26 +03:00
|
|
|
ztest_func_t ztest_mmp_enable_disable;
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_func_t ztest_scrub;
|
|
|
|
ztest_func_t ztest_dsl_dataset_promote_busy;
|
2008-11-20 23:01:55 +03:00
|
|
|
ztest_func_t ztest_vdev_attach_detach;
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
ztest_func_t ztest_vdev_raidz_attach;
|
2008-11-20 23:01:55 +03:00
|
|
|
ztest_func_t ztest_vdev_LUN_growth;
|
|
|
|
ztest_func_t ztest_vdev_add_remove;
|
2018-09-06 04:33:36 +03:00
|
|
|
ztest_func_t ztest_vdev_class_add;
|
2008-12-03 23:09:06 +03:00
|
|
|
ztest_func_t ztest_vdev_aux_add_remove;
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_func_t ztest_split_pool;
|
2011-11-12 02:07:54 +04:00
|
|
|
ztest_func_t ztest_reguid;
|
2012-12-15 04:28:49 +04:00
|
|
|
ztest_func_t ztest_spa_upgrade;
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
ztest_func_t ztest_device_removal;
|
2016-12-17 01:11:29 +03:00
|
|
|
ztest_func_t ztest_spa_checkpoint_create_discard;
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
ztest_func_t ztest_initialize;
|
2019-03-29 19:13:20 +03:00
|
|
|
ztest_func_t ztest_trim;
|
Introduce BLAKE3 checksums as an OpenZFS feature
This commit adds BLAKE3 checksums to OpenZFS, it has similar
performance to Edon-R, but without the caveats around the latter.
Homepage of BLAKE3: https://github.com/BLAKE3-team/BLAKE3
Wikipedia: https://en.wikipedia.org/wiki/BLAKE_(hash_function)#BLAKE3
Short description of Wikipedia:
BLAKE3 is a cryptographic hash function based on Bao and BLAKE2,
created by Jack O'Connor, Jean-Philippe Aumasson, Samuel Neves, and
Zooko Wilcox-O'Hearn. It was announced on January 9, 2020, at Real
World Crypto. BLAKE3 is a single algorithm with many desirable
features (parallelism, XOF, KDF, PRF and MAC), in contrast to BLAKE
and BLAKE2, which are algorithm families with multiple variants.
BLAKE3 has a binary tree structure, so it supports a practically
unlimited degree of parallelism (both SIMD and multithreading) given
enough input. The official Rust and C implementations are
dual-licensed as public domain (CC0) and the Apache License.
Along with adding the BLAKE3 hash into the OpenZFS infrastructure a
new benchmarking file called chksum_bench was introduced. When read
it reports the speed of the available checksum functions.
On Linux: cat /proc/spl/kstat/zfs/chksum_bench
On FreeBSD: sysctl kstat.zfs.misc.chksum_bench
This is an example output of an i3-1005G1 test system with Debian 11:
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1196 1602 1761 1749 1762 1759 1751
skein-generic 546 591 608 615 619 612 616
sha256-generic 240 300 316 314 304 285 276
sha512-generic 353 441 467 476 472 467 426
blake3-generic 308 313 313 313 312 313 312
blake3-sse2 402 1289 1423 1446 1432 1458 1413
blake3-sse41 427 1470 1625 1704 1679 1607 1629
blake3-avx2 428 1920 3095 3343 3356 3318 3204
blake3-avx512 473 2687 4905 5836 5844 5643 5374
Output on Debian 5.10.0-10-amd64 system: (Ryzen 7 5800X)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1840 2458 2665 2719 2711 2723 2693
skein-generic 870 966 996 992 1003 1005 1009
sha256-generic 415 442 453 455 457 457 457
sha512-generic 608 690 711 718 719 720 721
blake3-generic 301 313 311 309 309 310 310
blake3-sse2 343 1865 2124 2188 2180 2181 2186
blake3-sse41 364 2091 2396 2509 2463 2482 2488
blake3-avx2 365 2590 4399 4971 4915 4802 4764
Output on Debian 5.10.0-9-powerpc64le system: (POWER 9)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1213 1703 1889 1918 1957 1902 1907
skein-generic 434 492 520 522 511 525 525
sha256-generic 167 183 187 188 188 187 188
sha512-generic 186 216 222 221 225 224 224
blake3-generic 153 152 154 153 151 153 153
blake3-sse2 391 1170 1366 1406 1428 1426 1414
blake3-sse41 352 1049 1212 1174 1262 1258 1259
Output on Debian 5.10.0-11-arm64 system: (Pi400)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 487 603 629 639 643 641 641
skein-generic 271 299 303 308 309 309 307
sha256-generic 117 127 128 130 130 129 130
sha512-generic 145 165 170 172 173 174 175
blake3-generic 81 29 71 89 89 89 89
blake3-sse2 112 323 368 379 380 371 374
blake3-sse41 101 315 357 368 369 364 360
Structurally, the new code is mainly split into these parts:
- 1x cross platform generic c variant: blake3_generic.c
- 4x assembly for X86-64 (SSE2, SSE4.1, AVX2, AVX512)
- 2x assembly for ARMv8 (NEON converted from SSE2)
- 2x assembly for PPC64-LE (POWER8 converted from SSE2)
- one file for switching between the implementations
Note the PPC64 assembly requires the VSX instruction set and the
kfpu_begin() / kfpu_end() calls on PowerPC were updated accordingly.
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Co-authored-by: Rich Ercolani <rincebrain@gmail.com>
Closes #10058
Closes #12918
2022-06-09 01:55:57 +03:00
|
|
|
ztest_func_t ztest_blake3;
|
2015-12-10 02:34:16 +03:00
|
|
|
ztest_func_t ztest_fletcher;
|
2016-09-23 04:52:29 +03:00
|
|
|
ztest_func_t ztest_fletcher_incr;
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_func_t ztest_verify_dnode_bt;
|
2024-07-26 19:16:18 +03:00
|
|
|
ztest_func_t ztest_pool_prefetch_ddt;
|
2024-06-18 01:35:18 +03:00
|
|
|
ztest_func_t ztest_ddt_prune;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2022-10-18 21:05:32 +03:00
|
|
|
static uint64_t zopt_always = 0ULL * NANOSEC; /* all the time */
|
|
|
|
static uint64_t zopt_incessant = 1ULL * NANOSEC / 10; /* every 1/10 second */
|
|
|
|
static uint64_t zopt_often = 1ULL * NANOSEC; /* every second */
|
|
|
|
static uint64_t zopt_sometimes = 10ULL * NANOSEC; /* every 10 seconds */
|
|
|
|
static uint64_t zopt_rarely = 60ULL * NANOSEC; /* every 60 seconds */
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-02-24 21:53:31 +03:00
|
|
|
#define ZTI_INIT(func, iters, interval) \
|
|
|
|
{ .zi_func = (func), \
|
|
|
|
.zi_iters = (iters), \
|
|
|
|
.zi_interval = (interval), \
|
|
|
|
.zi_funcname = # func }
|
|
|
|
|
2022-10-18 21:05:32 +03:00
|
|
|
static ztest_info_t ztest_info[] = {
|
2015-02-24 21:53:31 +03:00
|
|
|
ZTI_INIT(ztest_dmu_read_write, 1, &zopt_always),
|
|
|
|
ZTI_INIT(ztest_dmu_write_parallel, 10, &zopt_always),
|
|
|
|
ZTI_INIT(ztest_dmu_object_alloc_free, 1, &zopt_always),
|
2018-01-19 12:19:47 +03:00
|
|
|
ZTI_INIT(ztest_dmu_object_next_chunk, 1, &zopt_sometimes),
|
2015-02-24 21:53:31 +03:00
|
|
|
ZTI_INIT(ztest_dmu_commit_callbacks, 1, &zopt_always),
|
|
|
|
ZTI_INIT(ztest_zap, 30, &zopt_always),
|
|
|
|
ZTI_INIT(ztest_zap_parallel, 100, &zopt_always),
|
2023-01-05 03:04:02 +03:00
|
|
|
ZTI_INIT(ztest_split_pool, 1, &zopt_sometimes),
|
2015-02-24 21:53:31 +03:00
|
|
|
ZTI_INIT(ztest_zil_commit, 1, &zopt_incessant),
|
|
|
|
ZTI_INIT(ztest_zil_remount, 1, &zopt_sometimes),
|
|
|
|
ZTI_INIT(ztest_dmu_read_write_zcopy, 1, &zopt_often),
|
|
|
|
ZTI_INIT(ztest_dmu_objset_create_destroy, 1, &zopt_often),
|
|
|
|
ZTI_INIT(ztest_dsl_prop_get_set, 1, &zopt_often),
|
|
|
|
ZTI_INIT(ztest_spa_prop_get_set, 1, &zopt_sometimes),
|
2010-05-29 00:45:14 +04:00
|
|
|
#if 0
|
2015-02-24 21:53:31 +03:00
|
|
|
ZTI_INIT(ztest_dmu_prealloc, 1, &zopt_sometimes),
|
2010-05-29 00:45:14 +04:00
|
|
|
#endif
|
2015-02-24 21:53:31 +03:00
|
|
|
ZTI_INIT(ztest_fzap, 1, &zopt_sometimes),
|
|
|
|
ZTI_INIT(ztest_dmu_snapshot_create_destroy, 1, &zopt_sometimes),
|
|
|
|
ZTI_INIT(ztest_spa_create_destroy, 1, &zopt_sometimes),
|
|
|
|
ZTI_INIT(ztest_fault_inject, 1, &zopt_sometimes),
|
|
|
|
ZTI_INIT(ztest_dmu_snapshot_hold, 1, &zopt_sometimes),
|
2017-07-21 03:54:26 +03:00
|
|
|
ZTI_INIT(ztest_mmp_enable_disable, 1, &zopt_sometimes),
|
2015-02-24 21:53:31 +03:00
|
|
|
ZTI_INIT(ztest_reguid, 1, &zopt_rarely),
|
|
|
|
ZTI_INIT(ztest_scrub, 1, &zopt_rarely),
|
|
|
|
ZTI_INIT(ztest_spa_upgrade, 1, &zopt_rarely),
|
|
|
|
ZTI_INIT(ztest_dsl_dataset_promote_busy, 1, &zopt_rarely),
|
|
|
|
ZTI_INIT(ztest_vdev_attach_detach, 1, &zopt_sometimes),
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
ZTI_INIT(ztest_vdev_raidz_attach, 1, &zopt_sometimes),
|
2015-02-24 21:53:31 +03:00
|
|
|
ZTI_INIT(ztest_vdev_LUN_growth, 1, &zopt_rarely),
|
|
|
|
ZTI_INIT(ztest_vdev_add_remove, 1, &ztest_opts.zo_vdevtime),
|
2018-09-06 04:33:36 +03:00
|
|
|
ZTI_INIT(ztest_vdev_class_add, 1, &ztest_opts.zo_vdevtime),
|
2015-02-24 21:53:31 +03:00
|
|
|
ZTI_INIT(ztest_vdev_aux_add_remove, 1, &ztest_opts.zo_vdevtime),
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
ZTI_INIT(ztest_device_removal, 1, &zopt_sometimes),
|
2016-12-17 01:11:29 +03:00
|
|
|
ZTI_INIT(ztest_spa_checkpoint_create_discard, 1, &zopt_rarely),
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
ZTI_INIT(ztest_initialize, 1, &zopt_sometimes),
|
2019-03-29 19:13:20 +03:00
|
|
|
ZTI_INIT(ztest_trim, 1, &zopt_sometimes),
|
Introduce BLAKE3 checksums as an OpenZFS feature
This commit adds BLAKE3 checksums to OpenZFS, it has similar
performance to Edon-R, but without the caveats around the latter.
Homepage of BLAKE3: https://github.com/BLAKE3-team/BLAKE3
Wikipedia: https://en.wikipedia.org/wiki/BLAKE_(hash_function)#BLAKE3
Short description of Wikipedia:
BLAKE3 is a cryptographic hash function based on Bao and BLAKE2,
created by Jack O'Connor, Jean-Philippe Aumasson, Samuel Neves, and
Zooko Wilcox-O'Hearn. It was announced on January 9, 2020, at Real
World Crypto. BLAKE3 is a single algorithm with many desirable
features (parallelism, XOF, KDF, PRF and MAC), in contrast to BLAKE
and BLAKE2, which are algorithm families with multiple variants.
BLAKE3 has a binary tree structure, so it supports a practically
unlimited degree of parallelism (both SIMD and multithreading) given
enough input. The official Rust and C implementations are
dual-licensed as public domain (CC0) and the Apache License.
Along with adding the BLAKE3 hash into the OpenZFS infrastructure a
new benchmarking file called chksum_bench was introduced. When read
it reports the speed of the available checksum functions.
On Linux: cat /proc/spl/kstat/zfs/chksum_bench
On FreeBSD: sysctl kstat.zfs.misc.chksum_bench
This is an example output of an i3-1005G1 test system with Debian 11:
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1196 1602 1761 1749 1762 1759 1751
skein-generic 546 591 608 615 619 612 616
sha256-generic 240 300 316 314 304 285 276
sha512-generic 353 441 467 476 472 467 426
blake3-generic 308 313 313 313 312 313 312
blake3-sse2 402 1289 1423 1446 1432 1458 1413
blake3-sse41 427 1470 1625 1704 1679 1607 1629
blake3-avx2 428 1920 3095 3343 3356 3318 3204
blake3-avx512 473 2687 4905 5836 5844 5643 5374
Output on Debian 5.10.0-10-amd64 system: (Ryzen 7 5800X)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1840 2458 2665 2719 2711 2723 2693
skein-generic 870 966 996 992 1003 1005 1009
sha256-generic 415 442 453 455 457 457 457
sha512-generic 608 690 711 718 719 720 721
blake3-generic 301 313 311 309 309 310 310
blake3-sse2 343 1865 2124 2188 2180 2181 2186
blake3-sse41 364 2091 2396 2509 2463 2482 2488
blake3-avx2 365 2590 4399 4971 4915 4802 4764
Output on Debian 5.10.0-9-powerpc64le system: (POWER 9)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1213 1703 1889 1918 1957 1902 1907
skein-generic 434 492 520 522 511 525 525
sha256-generic 167 183 187 188 188 187 188
sha512-generic 186 216 222 221 225 224 224
blake3-generic 153 152 154 153 151 153 153
blake3-sse2 391 1170 1366 1406 1428 1426 1414
blake3-sse41 352 1049 1212 1174 1262 1258 1259
Output on Debian 5.10.0-11-arm64 system: (Pi400)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 487 603 629 639 643 641 641
skein-generic 271 299 303 308 309 309 307
sha256-generic 117 127 128 130 130 129 130
sha512-generic 145 165 170 172 173 174 175
blake3-generic 81 29 71 89 89 89 89
blake3-sse2 112 323 368 379 380 371 374
blake3-sse41 101 315 357 368 369 364 360
Structurally, the new code is mainly split into these parts:
- 1x cross platform generic c variant: blake3_generic.c
- 4x assembly for X86-64 (SSE2, SSE4.1, AVX2, AVX512)
- 2x assembly for ARMv8 (NEON converted from SSE2)
- 2x assembly for PPC64-LE (POWER8 converted from SSE2)
- one file for switching between the implementations
Note the PPC64 assembly requires the VSX instruction set and the
kfpu_begin() / kfpu_end() calls on PowerPC were updated accordingly.
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Co-authored-by: Rich Ercolani <rincebrain@gmail.com>
Closes #10058
Closes #12918
2022-06-09 01:55:57 +03:00
|
|
|
ZTI_INIT(ztest_blake3, 1, &zopt_rarely),
|
2015-12-10 02:34:16 +03:00
|
|
|
ZTI_INIT(ztest_fletcher, 1, &zopt_rarely),
|
2016-09-23 04:52:29 +03:00
|
|
|
ZTI_INIT(ztest_fletcher_incr, 1, &zopt_rarely),
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ZTI_INIT(ztest_verify_dnode_bt, 1, &zopt_sometimes),
|
2024-07-26 19:16:18 +03:00
|
|
|
ZTI_INIT(ztest_pool_prefetch_ddt, 1, &zopt_rarely),
|
2024-06-18 01:35:18 +03:00
|
|
|
ZTI_INIT(ztest_ddt_prune, 1, &zopt_rarely),
|
2008-11-20 23:01:55 +03:00
|
|
|
};
|
|
|
|
|
|
|
|
#define ZTEST_FUNCS (sizeof (ztest_info) / sizeof (ztest_info_t))
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* The following struct is used to hold a list of uncalled commit callbacks.
|
|
|
|
* The callbacks are ordered by txg number.
|
|
|
|
*/
|
|
|
|
typedef struct ztest_cb_list {
|
2010-08-26 21:43:27 +04:00
|
|
|
kmutex_t zcl_callbacks_lock;
|
|
|
|
list_t zcl_callbacks;
|
2010-05-29 00:45:14 +04:00
|
|
|
} ztest_cb_list_t;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Stuff we need to share writably between parent and child.
|
|
|
|
*/
|
|
|
|
typedef struct ztest_shared {
|
2012-01-24 06:43:32 +04:00
|
|
|
boolean_t zs_do_init;
|
2010-05-29 00:45:14 +04:00
|
|
|
hrtime_t zs_proc_start;
|
|
|
|
hrtime_t zs_proc_stop;
|
|
|
|
hrtime_t zs_thread_start;
|
|
|
|
hrtime_t zs_thread_stop;
|
|
|
|
hrtime_t zs_thread_kill;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t zs_enospc_count;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t zs_vdev_next_leaf;
|
|
|
|
uint64_t zs_vdev_aux;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t zs_alloc;
|
|
|
|
uint64_t zs_space;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t zs_splits;
|
|
|
|
uint64_t zs_mirrors;
|
2012-01-24 06:43:32 +04:00
|
|
|
uint64_t zs_metaslab_sz;
|
|
|
|
uint64_t zs_metaslab_df_alloc_threshold;
|
|
|
|
uint64_t zs_guid;
|
2008-11-20 23:01:55 +03:00
|
|
|
} ztest_shared_t;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
#define ID_PARALLEL -1ULL
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static char ztest_dev_template[] = "%s/%s.%llua";
|
2008-12-03 23:09:06 +03:00
|
|
|
static char ztest_aux_template[] = "%s/%s.%s.%llu";
|
2022-10-18 21:05:32 +03:00
|
|
|
static ztest_shared_t *ztest_shared;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
static spa_t *ztest_spa = NULL;
|
|
|
|
static ztest_ds_t *ztest_ds;
|
|
|
|
|
|
|
|
static kmutex_t ztest_vdev_lock;
|
OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.
There are two underlying problems:
1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.
The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).
2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.
Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.
The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).
The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.
This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).
Porting notes:
* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
to dsl_scan_needs_resilver(), which was added to ZoL as part of the
sequential scrub work.
* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
parameter. The extra parameter is unique to ZoL.
* When posting indirect checksum errors the ABD can be passed directly,
zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-02-13 22:37:56 +03:00
|
|
|
static boolean_t ztest_device_removal_active = B_FALSE;
|
2019-01-18 20:47:55 +03:00
|
|
|
static boolean_t ztest_pool_scrubbed = B_FALSE;
|
2016-12-17 01:11:29 +03:00
|
|
|
static kmutex_t ztest_checkpoint_lock;
|
2012-12-15 00:38:04 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The ztest_name_lock protects the pool and dataset namespace used by
|
|
|
|
* the individual tests. To modify the namespace, consumers must grab
|
|
|
|
* this lock as writer. Grabbing the lock as reader will ensure that the
|
|
|
|
* namespace does not change while the lock is held.
|
|
|
|
*/
|
2018-06-05 02:52:10 +03:00
|
|
|
static pthread_rwlock_t ztest_name_lock;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
static boolean_t ztest_dump_core = B_TRUE;
|
2008-12-03 23:09:06 +03:00
|
|
|
static boolean_t ztest_exiting;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/* Global commit callback list */
|
|
|
|
static ztest_cb_list_t zcl;
|
2010-08-26 21:17:18 +04:00
|
|
|
/* Commit cb delay */
|
|
|
|
static uint64_t zc_min_txg_delay = UINT64_MAX;
|
|
|
|
static int zc_cb_counter = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Minimum number of commit callbacks that need to be registered for us to check
|
|
|
|
* whether the minimum txg delay is acceptable.
|
|
|
|
*/
|
|
|
|
#define ZTEST_COMMIT_CB_MIN_REG 100
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If a number of txgs equal to this threshold have been created after a commit
|
|
|
|
* callback has been registered but not called, then we assume there is an
|
|
|
|
* implementation bug.
|
|
|
|
*/
|
|
|
|
#define ZTEST_COMMIT_CB_THRESH (TXG_CONCURRENT_STATES + 1000)
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
enum ztest_object {
|
|
|
|
ZTEST_META_DNODE = 0,
|
|
|
|
ZTEST_DIROBJ,
|
|
|
|
ZTEST_OBJECTS
|
|
|
|
};
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2022-03-23 18:51:00 +03:00
|
|
|
static __attribute__((noreturn)) void usage(boolean_t requested);
|
2019-01-18 20:47:55 +03:00
|
|
|
static int ztest_scrub_impl(spa_t *spa);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* These libumem hooks provide a reasonable set of defaults for the allocator's
|
|
|
|
* debugging facilities.
|
|
|
|
*/
|
|
|
|
const char *
|
2010-08-26 20:52:41 +04:00
|
|
|
_umem_debug_init(void)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
return ("default,verbose"); /* $UMEM_DEBUG setting */
|
|
|
|
}
|
|
|
|
|
|
|
|
const char *
|
|
|
|
_umem_logging_init(void)
|
|
|
|
{
|
|
|
|
return ("fail,contents"); /* $UMEM_LOGGING setting */
|
|
|
|
}
|
|
|
|
|
2017-12-19 01:06:07 +03:00
|
|
|
static void
|
|
|
|
dump_debug_buffer(void)
|
|
|
|
{
|
2018-10-15 22:14:22 +03:00
|
|
|
ssize_t ret __attribute__((unused));
|
|
|
|
|
|
|
|
if (!ztest_opts.zo_dump_dbgmsg)
|
2017-12-19 01:06:07 +03:00
|
|
|
return;
|
|
|
|
|
2018-10-15 22:14:22 +03:00
|
|
|
/*
|
|
|
|
* We use write() instead of printf() so that this function
|
|
|
|
* is safe to call from a signal handler.
|
|
|
|
*/
|
2024-05-10 06:58:26 +03:00
|
|
|
ret = write(STDERR_FILENO, "\n", 1);
|
|
|
|
zfs_dbgmsg_print(STDERR_FILENO, "ztest");
|
2017-12-19 01:06:07 +03:00
|
|
|
}
|
|
|
|
|
2014-10-11 05:05:54 +04:00
|
|
|
static void sig_handler(int signo)
|
|
|
|
{
|
|
|
|
struct sigaction action;
|
|
|
|
|
2024-05-10 06:04:14 +03:00
|
|
|
libspl_backtrace(STDERR_FILENO);
|
2017-12-19 01:06:07 +03:00
|
|
|
dump_debug_buffer();
|
2014-10-11 05:05:54 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Restore default action and re-raise signal so SIGSEGV and
|
|
|
|
* SIGABRT can trigger a core dump.
|
|
|
|
*/
|
|
|
|
action.sa_handler = SIG_DFL;
|
|
|
|
sigemptyset(&action.sa_mask);
|
|
|
|
action.sa_flags = 0;
|
|
|
|
(void) sigaction(signo, &action, NULL);
|
|
|
|
raise(signo);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
#define FATAL_MSG_SZ 1024
|
|
|
|
|
2022-04-19 21:38:30 +03:00
|
|
|
static const char *fatal_msg;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2022-03-23 18:51:00 +03:00
|
|
|
static __attribute__((format(printf, 2, 3))) __attribute__((noreturn)) void
|
2022-04-19 21:38:30 +03:00
|
|
|
fatal(int do_perror, const char *message, ...)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
va_list args;
|
|
|
|
int save_errno = errno;
|
2010-08-26 22:13:05 +04:00
|
|
|
char *buf;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
(void) fflush(stdout);
|
2010-08-26 22:13:05 +04:00
|
|
|
buf = umem_alloc(FATAL_MSG_SZ, UMEM_NOFAIL);
|
2022-02-04 01:35:38 +03:00
|
|
|
if (buf == NULL)
|
|
|
|
goto out;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
va_start(args, message);
|
|
|
|
(void) sprintf(buf, "ztest: ");
|
|
|
|
/* LINTED */
|
|
|
|
(void) vsprintf(buf + strlen(buf), message, args);
|
|
|
|
va_end(args);
|
|
|
|
if (do_perror) {
|
|
|
|
(void) snprintf(buf + strlen(buf), FATAL_MSG_SZ - strlen(buf),
|
|
|
|
": %s", strerror(save_errno));
|
|
|
|
}
|
|
|
|
(void) fprintf(stderr, "%s\n", buf);
|
|
|
|
fatal_msg = buf; /* to ease debugging */
|
2017-12-19 01:06:07 +03:00
|
|
|
|
2022-02-04 01:35:38 +03:00
|
|
|
out:
|
2008-11-20 23:01:55 +03:00
|
|
|
if (ztest_dump_core)
|
|
|
|
abort();
|
2018-10-15 22:14:22 +03:00
|
|
|
else
|
|
|
|
dump_debug_buffer();
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
exit(3);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
str2shift(const char *buf)
|
|
|
|
{
|
|
|
|
const char *ends = "BKMGTPEZ";
|
2024-10-02 19:10:06 +03:00
|
|
|
int i, len;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (buf[0] == '\0')
|
|
|
|
return (0);
|
2024-10-02 19:10:06 +03:00
|
|
|
|
|
|
|
len = strlen(ends);
|
|
|
|
for (i = 0; i < len; i++) {
|
2008-11-20 23:01:55 +03:00
|
|
|
if (toupper(buf[0]) == ends[i])
|
|
|
|
break;
|
|
|
|
}
|
2024-10-02 19:10:06 +03:00
|
|
|
if (i == len) {
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) fprintf(stderr, "ztest: invalid bytes suffix: %s\n",
|
|
|
|
buf);
|
|
|
|
usage(B_FALSE);
|
|
|
|
}
|
|
|
|
if (buf[1] == '\0' || (toupper(buf[1]) == 'B' && buf[2] == '\0')) {
|
|
|
|
return (10*i);
|
|
|
|
}
|
|
|
|
(void) fprintf(stderr, "ztest: invalid bytes suffix: %s\n", buf);
|
|
|
|
usage(B_FALSE);
|
|
|
|
}
|
|
|
|
|
|
|
|
static uint64_t
|
|
|
|
nicenumtoull(const char *buf)
|
|
|
|
{
|
|
|
|
char *end;
|
|
|
|
uint64_t val;
|
|
|
|
|
|
|
|
val = strtoull(buf, &end, 0);
|
|
|
|
if (end == buf) {
|
|
|
|
(void) fprintf(stderr, "ztest: bad numeric value: %s\n", buf);
|
|
|
|
usage(B_FALSE);
|
|
|
|
} else if (end[0] == '.') {
|
|
|
|
double fval = strtod(buf, &end);
|
|
|
|
fval *= pow(2, str2shift(end));
|
2020-03-16 21:56:29 +03:00
|
|
|
/*
|
|
|
|
* UINT64_MAX is not exactly representable as a double.
|
|
|
|
* The closest representation is UINT64_MAX + 1, so we
|
|
|
|
* use a >= comparison instead of > for the bounds check.
|
|
|
|
*/
|
|
|
|
if (fval >= (double)UINT64_MAX) {
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) fprintf(stderr, "ztest: value too large: %s\n",
|
|
|
|
buf);
|
|
|
|
usage(B_FALSE);
|
|
|
|
}
|
|
|
|
val = (uint64_t)fval;
|
|
|
|
} else {
|
|
|
|
int shift = str2shift(end);
|
|
|
|
if (shift >= 64 || (val << shift) >> shift != val) {
|
|
|
|
(void) fprintf(stderr, "ztest: value too large: %s\n",
|
|
|
|
buf);
|
|
|
|
usage(B_FALSE);
|
|
|
|
}
|
|
|
|
val <<= shift;
|
|
|
|
}
|
|
|
|
return (val);
|
|
|
|
}
|
|
|
|
|
2021-05-29 01:06:07 +03:00
|
|
|
typedef struct ztest_option {
|
|
|
|
const char short_opt;
|
|
|
|
const char *long_opt;
|
|
|
|
const char *long_opt_param;
|
|
|
|
const char *comment;
|
|
|
|
unsigned int default_int;
|
2022-04-19 21:38:30 +03:00
|
|
|
const char *default_str;
|
2021-05-29 01:06:07 +03:00
|
|
|
} ztest_option_t;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The following option_table is used for generating the usage info as well as
|
|
|
|
* the long and short option information for calling getopt_long().
|
|
|
|
*/
|
|
|
|
static ztest_option_t option_table[] = {
|
|
|
|
{ 'v', "vdevs", "INTEGER", "Number of vdevs", DEFAULT_VDEV_COUNT,
|
|
|
|
NULL},
|
|
|
|
{ 's', "vdev-size", "INTEGER", "Size of each vdev",
|
|
|
|
NO_DEFAULT, DEFAULT_VDEV_SIZE_STR},
|
|
|
|
{ 'a', "alignment-shift", "INTEGER",
|
|
|
|
"Alignment shift; use 0 for random", DEFAULT_ASHIFT, NULL},
|
|
|
|
{ 'm', "mirror-copies", "INTEGER", "Number of mirror copies",
|
|
|
|
DEFAULT_MIRRORS, NULL},
|
|
|
|
{ 'r', "raid-disks", "INTEGER", "Number of raidz/draid disks",
|
|
|
|
DEFAULT_RAID_CHILDREN, NULL},
|
|
|
|
{ 'R', "raid-parity", "INTEGER", "Raid parity",
|
|
|
|
DEFAULT_RAID_PARITY, NULL},
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
{ 'K', "raid-kind", "raidz|eraidz|draid|random", "Raid kind",
|
2021-05-29 01:06:07 +03:00
|
|
|
NO_DEFAULT, "random"},
|
|
|
|
{ 'D', "draid-data", "INTEGER", "Number of draid data drives",
|
|
|
|
DEFAULT_DRAID_DATA, NULL},
|
|
|
|
{ 'S', "draid-spares", "INTEGER", "Number of draid spares",
|
|
|
|
DEFAULT_DRAID_SPARES, NULL},
|
|
|
|
{ 'd', "datasets", "INTEGER", "Number of datasets",
|
|
|
|
DEFAULT_DATASETS_COUNT, NULL},
|
|
|
|
{ 't', "threads", "INTEGER", "Number of ztest threads",
|
|
|
|
DEFAULT_THREADS, NULL},
|
|
|
|
{ 'g', "gang-block-threshold", "INTEGER",
|
|
|
|
"Metaslab gang block threshold",
|
|
|
|
NO_DEFAULT, DEFAULT_FORCE_GANGING_STR},
|
|
|
|
{ 'i', "init-count", "INTEGER", "Number of times to initialize pool",
|
|
|
|
DEFAULT_INITS, NULL},
|
|
|
|
{ 'k', "kill-percentage", "INTEGER", "Kill percentage",
|
|
|
|
NO_DEFAULT, DEFAULT_KILLRATE_STR},
|
|
|
|
{ 'p', "pool-name", "STRING", "Pool name",
|
|
|
|
NO_DEFAULT, DEFAULT_POOL},
|
|
|
|
{ 'f', "vdev-file-directory", "PATH", "File directory for vdev files",
|
|
|
|
NO_DEFAULT, DEFAULT_VDEV_DIR},
|
|
|
|
{ 'M', "multi-host", NULL,
|
|
|
|
"Multi-host; simulate pool imported on remote host",
|
|
|
|
NO_DEFAULT, NULL},
|
|
|
|
{ 'E', "use-existing-pool", NULL,
|
|
|
|
"Use existing pool instead of creating new one", NO_DEFAULT, NULL},
|
|
|
|
{ 'T', "run-time", "INTEGER", "Total run time",
|
|
|
|
NO_DEFAULT, DEFAULT_RUN_TIME_STR},
|
|
|
|
{ 'P', "pass-time", "INTEGER", "Time per pass",
|
|
|
|
NO_DEFAULT, DEFAULT_PASS_TIME_STR},
|
|
|
|
{ 'F', "freeze-loops", "INTEGER", "Max loops in spa_freeze()",
|
|
|
|
DEFAULT_MAX_LOOPS, NULL},
|
|
|
|
{ 'B', "alt-ztest", "PATH", "Alternate ztest path",
|
|
|
|
NO_DEFAULT, NULL},
|
|
|
|
{ 'C', "vdev-class-state", "on|off|random", "vdev class state",
|
|
|
|
NO_DEFAULT, "random"},
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
{ 'X', "raidz-expansion", NULL,
|
|
|
|
"Perform a dedicated raidz expansion test",
|
|
|
|
NO_DEFAULT, NULL},
|
2021-05-29 01:06:07 +03:00
|
|
|
{ 'o', "option", "\"OPTION=INTEGER\"",
|
|
|
|
"Set global variable to an unsigned 32-bit integer value",
|
|
|
|
NO_DEFAULT, NULL},
|
|
|
|
{ 'G', "dump-debug-msg", NULL,
|
|
|
|
"Dump zfs_dbgmsg buffer before exiting due to an error",
|
|
|
|
NO_DEFAULT, NULL},
|
|
|
|
{ 'V', "verbose", NULL,
|
|
|
|
"Verbose (use multiple times for ever more verbosity)",
|
|
|
|
NO_DEFAULT, NULL},
|
|
|
|
{ 'h', "help", NULL, "Show this help",
|
|
|
|
NO_DEFAULT, NULL},
|
|
|
|
{0, 0, 0, 0, 0, 0}
|
|
|
|
};
|
|
|
|
|
|
|
|
static struct option *long_opts = NULL;
|
|
|
|
static char *short_opts = NULL;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
2021-05-29 01:06:07 +03:00
|
|
|
init_options(void)
|
|
|
|
{
|
|
|
|
ASSERT3P(long_opts, ==, NULL);
|
|
|
|
ASSERT3P(short_opts, ==, NULL);
|
|
|
|
|
|
|
|
int count = sizeof (option_table) / sizeof (option_table[0]);
|
|
|
|
long_opts = umem_alloc(sizeof (struct option) * count, UMEM_NOFAIL);
|
|
|
|
|
|
|
|
short_opts = umem_alloc(sizeof (char) * 2 * count, UMEM_NOFAIL);
|
|
|
|
int short_opt_index = 0;
|
|
|
|
|
|
|
|
for (int i = 0; i < count; i++) {
|
|
|
|
long_opts[i].val = option_table[i].short_opt;
|
|
|
|
long_opts[i].name = option_table[i].long_opt;
|
|
|
|
long_opts[i].has_arg = option_table[i].long_opt_param != NULL
|
|
|
|
? required_argument : no_argument;
|
|
|
|
long_opts[i].flag = NULL;
|
|
|
|
short_opts[short_opt_index++] = option_table[i].short_opt;
|
|
|
|
if (option_table[i].long_opt_param != NULL) {
|
|
|
|
short_opts[short_opt_index++] = ':';
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
fini_options(void)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2021-05-29 01:06:07 +03:00
|
|
|
int count = sizeof (option_table) / sizeof (option_table[0]);
|
|
|
|
|
|
|
|
umem_free(long_opts, sizeof (struct option) * count);
|
|
|
|
umem_free(short_opts, sizeof (char) * 2 * count);
|
|
|
|
|
|
|
|
long_opts = NULL;
|
|
|
|
short_opts = NULL;
|
|
|
|
}
|
2012-01-24 06:43:32 +04:00
|
|
|
|
2022-03-23 18:51:00 +03:00
|
|
|
static __attribute__((noreturn)) void
|
2021-05-29 01:06:07 +03:00
|
|
|
usage(boolean_t requested)
|
|
|
|
{
|
|
|
|
char option[80];
|
2008-11-20 23:01:55 +03:00
|
|
|
FILE *fp = requested ? stdout : stderr;
|
|
|
|
|
2021-05-29 01:06:07 +03:00
|
|
|
(void) fprintf(fp, "Usage: %s [OPTIONS...]\n", DEFAULT_POOL);
|
|
|
|
for (int i = 0; option_table[i].short_opt != 0; i++) {
|
|
|
|
if (option_table[i].long_opt_param != NULL) {
|
|
|
|
(void) sprintf(option, " -%c --%s=%s",
|
|
|
|
option_table[i].short_opt,
|
|
|
|
option_table[i].long_opt,
|
|
|
|
option_table[i].long_opt_param);
|
|
|
|
} else {
|
|
|
|
(void) sprintf(option, " -%c --%s",
|
|
|
|
option_table[i].short_opt,
|
|
|
|
option_table[i].long_opt);
|
|
|
|
}
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
(void) fprintf(fp, " %-43s%s", option,
|
2021-05-29 01:06:07 +03:00
|
|
|
option_table[i].comment);
|
|
|
|
|
|
|
|
if (option_table[i].long_opt_param != NULL) {
|
|
|
|
if (option_table[i].default_str != NULL) {
|
|
|
|
(void) fprintf(fp, " (default: %s)",
|
|
|
|
option_table[i].default_str);
|
|
|
|
} else if (option_table[i].default_int != NO_DEFAULT) {
|
|
|
|
(void) fprintf(fp, " (default: %u)",
|
|
|
|
option_table[i].default_int);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
(void) fprintf(fp, "\n");
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
exit(requested ? 0 : 1);
|
|
|
|
}
|
|
|
|
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
static uint64_t
|
|
|
|
ztest_random(uint64_t range)
|
|
|
|
{
|
|
|
|
uint64_t r;
|
|
|
|
|
|
|
|
ASSERT3S(ztest_fd_rand, >=, 0);
|
|
|
|
|
|
|
|
if (range == 0)
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
if (read(ztest_fd_rand, &r, sizeof (r)) != sizeof (r))
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_TRUE, "short read from /dev/urandom");
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
|
|
|
|
return (r % range);
|
|
|
|
}
|
2018-09-06 04:33:36 +03:00
|
|
|
|
|
|
|
static void
|
|
|
|
ztest_parse_name_value(const char *input, ztest_shared_opts_t *zo)
|
|
|
|
{
|
|
|
|
char name[32];
|
|
|
|
char *value;
|
|
|
|
int state = ZTEST_VDEV_CLASS_RND;
|
|
|
|
|
|
|
|
(void) strlcpy(name, input, sizeof (name));
|
|
|
|
|
|
|
|
value = strchr(name, '=');
|
|
|
|
if (value == NULL) {
|
|
|
|
(void) fprintf(stderr, "missing value in property=value "
|
|
|
|
"'-C' argument (%s)\n", input);
|
|
|
|
usage(B_FALSE);
|
|
|
|
}
|
|
|
|
*(value) = '\0';
|
|
|
|
value++;
|
|
|
|
|
|
|
|
if (strcmp(value, "on") == 0) {
|
|
|
|
state = ZTEST_VDEV_CLASS_ON;
|
|
|
|
} else if (strcmp(value, "off") == 0) {
|
|
|
|
state = ZTEST_VDEV_CLASS_OFF;
|
|
|
|
} else if (strcmp(value, "random") == 0) {
|
|
|
|
state = ZTEST_VDEV_CLASS_RND;
|
|
|
|
} else {
|
|
|
|
(void) fprintf(stderr, "invalid property value '%s'\n", value);
|
|
|
|
usage(B_FALSE);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (strcmp(name, "special") == 0) {
|
|
|
|
zo->zo_special_vdevs = state;
|
|
|
|
} else {
|
|
|
|
(void) fprintf(stderr, "invalid property name '%s'\n", name);
|
|
|
|
usage(B_FALSE);
|
|
|
|
}
|
|
|
|
if (zo->zo_verbose >= 3)
|
|
|
|
(void) printf("%s vdev state is '%s'\n", name, value);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
process_options(int argc, char **argv)
|
|
|
|
{
|
2012-01-24 06:43:32 +04:00
|
|
|
char *path;
|
|
|
|
ztest_shared_opts_t *zo = &ztest_opts;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
int opt;
|
|
|
|
uint64_t value;
|
2022-05-03 17:55:14 +03:00
|
|
|
const char *raid_kind = "random";
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(zo, &ztest_opts_defaults, sizeof (*zo));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-05-29 01:06:07 +03:00
|
|
|
init_options();
|
|
|
|
|
|
|
|
while ((opt = getopt_long(argc, argv, short_opts, long_opts,
|
|
|
|
NULL)) != EOF) {
|
2008-11-20 23:01:55 +03:00
|
|
|
value = 0;
|
|
|
|
switch (opt) {
|
|
|
|
case 'v':
|
|
|
|
case 's':
|
|
|
|
case 'a':
|
|
|
|
case 'm':
|
|
|
|
case 'r':
|
|
|
|
case 'R':
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
case 'D':
|
|
|
|
case 'S':
|
2008-11-20 23:01:55 +03:00
|
|
|
case 'd':
|
|
|
|
case 't':
|
|
|
|
case 'g':
|
|
|
|
case 'i':
|
|
|
|
case 'k':
|
|
|
|
case 'T':
|
|
|
|
case 'P':
|
2010-05-29 00:45:14 +04:00
|
|
|
case 'F':
|
2008-11-20 23:01:55 +03:00
|
|
|
value = nicenumtoull(optarg);
|
|
|
|
}
|
|
|
|
switch (opt) {
|
|
|
|
case 'v':
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_vdevs = value;
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 's':
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_vdev_size = MAX(SPA_MINDEVSIZE, value);
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 'a':
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_ashift = value;
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 'm':
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_mirrors = value;
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 'r':
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
zo->zo_raid_children = MAX(1, value);
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 'R':
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
zo->zo_raid_parity = MIN(MAX(value, 1), 3);
|
|
|
|
break;
|
|
|
|
case 'K':
|
2022-05-03 17:55:14 +03:00
|
|
|
raid_kind = optarg;
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
break;
|
|
|
|
case 'D':
|
|
|
|
zo->zo_draid_data = MAX(1, value);
|
|
|
|
break;
|
|
|
|
case 'S':
|
|
|
|
zo->zo_draid_spares = MAX(1, value);
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 'd':
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_datasets = MAX(1, value);
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 't':
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_threads = MAX(1, value);
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 'g':
|
2017-07-24 21:07:39 +03:00
|
|
|
zo->zo_metaslab_force_ganging =
|
|
|
|
MAX(SPA_MINBLOCKSIZE << 1, value);
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 'i':
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_init = value;
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 'k':
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_killrate = value;
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 'p':
|
2012-01-24 06:43:32 +04:00
|
|
|
(void) strlcpy(zo->zo_pool, optarg,
|
|
|
|
sizeof (zo->zo_pool));
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 'f':
|
2012-01-24 06:43:32 +04:00
|
|
|
path = realpath(optarg, NULL);
|
|
|
|
if (path == NULL) {
|
|
|
|
(void) fprintf(stderr, "error: %s: %s\n",
|
|
|
|
optarg, strerror(errno));
|
|
|
|
usage(B_FALSE);
|
|
|
|
} else {
|
|
|
|
(void) strlcpy(zo->zo_dir, path,
|
|
|
|
sizeof (zo->zo_dir));
|
2016-07-28 23:10:05 +03:00
|
|
|
free(path);
|
2012-01-24 06:43:32 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
case 'M':
|
|
|
|
zo->zo_mmp_test = 1;
|
|
|
|
break;
|
2008-11-20 23:01:55 +03:00
|
|
|
case 'V':
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_verbose++;
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
case 'X':
|
|
|
|
zo->zo_raidz_expand_test = RAIDZ_EXPAND_REQUESTED;
|
|
|
|
break;
|
2008-11-20 23:01:55 +03:00
|
|
|
case 'E':
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_init = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 'T':
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_time = value;
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 'P':
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_passtime = MAX(1, value);
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
2010-05-29 00:45:14 +04:00
|
|
|
case 'F':
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_maxloops = MAX(1, value);
|
|
|
|
break;
|
|
|
|
case 'B':
|
2022-05-03 17:55:14 +03:00
|
|
|
(void) strlcpy(zo->zo_alt_ztest, optarg,
|
|
|
|
sizeof (zo->zo_alt_ztest));
|
2010-05-29 00:45:14 +04:00
|
|
|
break;
|
2018-09-06 04:33:36 +03:00
|
|
|
case 'C':
|
|
|
|
ztest_parse_name_value(optarg, zo);
|
|
|
|
break;
|
2017-01-31 21:13:10 +03:00
|
|
|
case 'o':
|
2021-02-16 13:14:44 +03:00
|
|
|
if (zo->zo_gvars_count >= ZO_GVARS_MAX_COUNT) {
|
|
|
|
(void) fprintf(stderr,
|
|
|
|
"max global var count (%zu) exceeded\n",
|
|
|
|
ZO_GVARS_MAX_COUNT);
|
|
|
|
usage(B_FALSE);
|
|
|
|
}
|
|
|
|
char *v = zo->zo_gvars[zo->zo_gvars_count];
|
|
|
|
if (strlcpy(v, optarg, ZO_GVARS_MAX_ARGLEN) >=
|
|
|
|
ZO_GVARS_MAX_ARGLEN) {
|
|
|
|
(void) fprintf(stderr,
|
|
|
|
"global var option '%s' is too long\n",
|
|
|
|
optarg);
|
2017-01-31 21:13:10 +03:00
|
|
|
usage(B_FALSE);
|
2021-02-16 13:14:44 +03:00
|
|
|
}
|
|
|
|
zo->zo_gvars_count++;
|
2017-01-31 21:13:10 +03:00
|
|
|
break;
|
2017-12-19 01:06:07 +03:00
|
|
|
case 'G':
|
2018-10-15 22:14:22 +03:00
|
|
|
zo->zo_dump_dbgmsg = 1;
|
2017-12-19 01:06:07 +03:00
|
|
|
break;
|
2008-11-20 23:01:55 +03:00
|
|
|
case 'h':
|
|
|
|
usage(B_TRUE);
|
|
|
|
break;
|
|
|
|
case '?':
|
|
|
|
default:
|
|
|
|
usage(B_FALSE);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-05-29 01:06:07 +03:00
|
|
|
fini_options();
|
|
|
|
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
/* Force compatible options for raidz expansion run */
|
|
|
|
if (zo->zo_raidz_expand_test == RAIDZ_EXPAND_REQUESTED) {
|
|
|
|
zo->zo_mmp_test = 0;
|
|
|
|
zo->zo_mirrors = 0;
|
|
|
|
zo->zo_vdevs = 1;
|
|
|
|
zo->zo_vdev_size = DEFAULT_VDEV_SIZE * 2;
|
|
|
|
zo->zo_raid_do_expand = B_FALSE;
|
|
|
|
raid_kind = "raidz";
|
|
|
|
}
|
|
|
|
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
if (strcmp(raid_kind, "random") == 0) {
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
switch (ztest_random(3)) {
|
|
|
|
case 0:
|
|
|
|
raid_kind = "raidz";
|
|
|
|
break;
|
|
|
|
case 1:
|
|
|
|
raid_kind = "eraidz";
|
|
|
|
break;
|
|
|
|
case 2:
|
|
|
|
raid_kind = "draid";
|
|
|
|
break;
|
|
|
|
}
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
|
|
|
|
if (ztest_opts.zo_verbose >= 3)
|
|
|
|
(void) printf("choosing RAID type '%s'\n", raid_kind);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (strcmp(raid_kind, "draid") == 0) {
|
|
|
|
uint64_t min_devsize;
|
|
|
|
|
|
|
|
/* With fewer disk use 256M, otherwise 128M is OK */
|
|
|
|
min_devsize = (ztest_opts.zo_raid_children < 16) ?
|
|
|
|
(256ULL << 20) : (128ULL << 20);
|
|
|
|
|
|
|
|
/* No top-level mirrors with dRAID for now */
|
|
|
|
zo->zo_mirrors = 0;
|
|
|
|
|
|
|
|
/* Use more appropriate defaults for dRAID */
|
|
|
|
if (zo->zo_vdevs == ztest_opts_defaults.zo_vdevs)
|
|
|
|
zo->zo_vdevs = 1;
|
|
|
|
if (zo->zo_raid_children ==
|
|
|
|
ztest_opts_defaults.zo_raid_children)
|
|
|
|
zo->zo_raid_children = 16;
|
|
|
|
if (zo->zo_ashift < 12)
|
|
|
|
zo->zo_ashift = 12;
|
|
|
|
if (zo->zo_vdev_size < min_devsize)
|
|
|
|
zo->zo_vdev_size = min_devsize;
|
|
|
|
|
|
|
|
if (zo->zo_draid_data + zo->zo_raid_parity >
|
|
|
|
zo->zo_raid_children - zo->zo_draid_spares) {
|
|
|
|
(void) fprintf(stderr, "error: too few draid "
|
|
|
|
"children (%d) for stripe width (%d)\n",
|
|
|
|
zo->zo_raid_children,
|
|
|
|
zo->zo_draid_data + zo->zo_raid_parity);
|
|
|
|
usage(B_FALSE);
|
|
|
|
}
|
|
|
|
|
|
|
|
(void) strlcpy(zo->zo_raid_type, VDEV_TYPE_DRAID,
|
|
|
|
sizeof (zo->zo_raid_type));
|
|
|
|
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
} else if (strcmp(raid_kind, "eraidz") == 0) {
|
|
|
|
/* using eraidz (expandable raidz) */
|
|
|
|
zo->zo_raid_do_expand = B_TRUE;
|
|
|
|
|
|
|
|
/* tests expect top-level to be raidz */
|
|
|
|
zo->zo_mirrors = 0;
|
|
|
|
zo->zo_vdevs = 1;
|
|
|
|
|
|
|
|
/* Make sure parity is less than data columns */
|
|
|
|
zo->zo_raid_parity = MIN(zo->zo_raid_parity,
|
|
|
|
zo->zo_raid_children - 1);
|
|
|
|
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
} else /* using raidz */ {
|
|
|
|
ASSERT0(strcmp(raid_kind, "raidz"));
|
|
|
|
|
|
|
|
zo->zo_raid_parity = MIN(zo->zo_raid_parity,
|
|
|
|
zo->zo_raid_children - 1);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
zo->zo_vdevtime =
|
|
|
|
(zo->zo_vdevs > 0 ? zo->zo_time * NANOSEC / zo->zo_vdevs :
|
2010-05-29 00:45:14 +04:00
|
|
|
UINT64_MAX >> 2);
|
2012-01-24 06:43:32 +04:00
|
|
|
|
2022-05-03 17:55:14 +03:00
|
|
|
if (*zo->zo_alt_ztest) {
|
|
|
|
const char *invalid_what = "ztest";
|
|
|
|
char *val = zo->zo_alt_ztest;
|
|
|
|
if (0 != access(val, X_OK) ||
|
2022-12-04 02:08:55 +03:00
|
|
|
(strrchr(val, '/') == NULL && (errno == EINVAL)))
|
2022-05-03 17:55:14 +03:00
|
|
|
goto invalid;
|
|
|
|
|
|
|
|
int dirlen = strrchr(val, '/') - val;
|
Cleanup: Switch to strlcpy from strncpy
Coverity found a bug in `zfs_secpolicy_create_clone()` where it is
possible for us to pass an unterminated string when `zfs_get_parent()`
returns an error. Upon inspection, it is clear that using `strlcpy()`
would have avoided this issue.
Looking at the codebase, there are a number of other uses of `strncpy()`
that are unsafe and even when it is used safely, switching to
`strlcpy()` would make the code more readable. Therefore, we switch all
instances where we use `strncpy()` to use `strlcpy()`.
Unfortunately, we do not portably have access to `strlcpy()` in
tests/zfs-tests/cmd/zfs_diff-socket.c because it does not link to
libspl. Modifying the appropriate Makefile.am to try to link to it
resulted in an error from the naming choice used in the file. Trying to
disable the check on the file did not work on FreeBSD because Clang
ignores `#undef` when a definition is provided by `-Dstrncpy(...)=...`.
We workaround that by explictly including the C file from libspl into
the test. This makes things build correctly everywhere.
We add a deprecation warning to `config/Rules.am` and suppress it on the
remaining `strncpy()` usage. `strlcpy()` is not portably avaliable in
tests/zfs-tests/cmd/zfs_diff-socket.c, so we use `snprintf()` there as a
substitute.
This patch does not tackle the related problem of `strcpy()`, which is
even less safe. Thankfully, a quick inspection found that it is used far
more correctly than strncpy() was used. A quick inspection did not find
any problems with `strcpy()` usage outside of zhack, but it should be
said that I only checked around 90% of them.
Lastly, some of the fields in kstat_t varied in size by 1 depending on
whether they were in userspace or in the kernel. The origin of this
discrepancy appears to be 04a479f7066ccdaa23a6546955303b172f4a6909 where
it was made for no apparent reason. It conflicts with the comment on
KSTAT_STRLEN, so we shrink the kernel field sizes to match the userspace
field sizes.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13876
2022-09-28 02:35:29 +03:00
|
|
|
strlcpy(zo->zo_alt_libpath, val,
|
|
|
|
MIN(sizeof (zo->zo_alt_libpath), dirlen + 1));
|
2022-05-03 17:55:14 +03:00
|
|
|
invalid_what = "library path", val = zo->zo_alt_libpath;
|
2022-12-04 02:08:55 +03:00
|
|
|
if (strrchr(val, '/') == NULL && (errno == EINVAL))
|
2022-05-03 17:55:14 +03:00
|
|
|
goto invalid;
|
|
|
|
*strrchr(val, '/') = '\0';
|
|
|
|
strlcat(val, "/lib", sizeof (zo->zo_alt_libpath));
|
|
|
|
|
|
|
|
if (0 != access(zo->zo_alt_libpath, X_OK))
|
|
|
|
goto invalid;
|
|
|
|
return;
|
2012-10-04 23:54:47 +04:00
|
|
|
|
2022-05-03 17:55:14 +03:00
|
|
|
invalid:
|
|
|
|
ztest_dump_core = B_FALSE;
|
|
|
|
fatal(B_TRUE, "invalid alternate %s %s", invalid_what, val);
|
2012-01-24 06:43:32 +04:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
ztest_kill(ztest_shared_t *zs)
|
|
|
|
{
|
2012-01-24 06:43:32 +04:00
|
|
|
zs->zs_alloc = metaslab_class_get_alloc(spa_normal_class(ztest_spa));
|
|
|
|
zs->zs_space = metaslab_class_get_space(spa_normal_class(ztest_spa));
|
2013-08-08 00:16:22 +04:00
|
|
|
|
|
|
|
/*
|
2022-05-03 17:55:14 +03:00
|
|
|
* Before we kill ourselves, make sure that the config is updated.
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
* See comment above spa_write_cachefile().
|
2013-08-08 00:16:22 +04:00
|
|
|
*/
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
if (raidz_expand_pause_point != RAIDZ_EXPAND_PAUSE_NONE) {
|
|
|
|
if (mutex_tryenter(&spa_namespace_lock)) {
|
|
|
|
spa_write_cachefile(ztest_spa, B_FALSE, B_FALSE,
|
|
|
|
B_FALSE);
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
|
|
|
|
ztest_scratch_state->zs_raidz_scratch_verify_pause =
|
|
|
|
raidz_expand_pause_point;
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Do not verify scratch object in case if
|
|
|
|
* spa_namespace_lock cannot be acquired,
|
|
|
|
* it can cause deadlock in spa_config_update().
|
|
|
|
*/
|
|
|
|
raidz_expand_pause_point = RAIDZ_EXPAND_PAUSE_NONE;
|
|
|
|
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
spa_write_cachefile(ztest_spa, B_FALSE, B_FALSE, B_FALSE);
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
}
|
2013-08-08 00:16:22 +04:00
|
|
|
|
2022-05-03 17:55:14 +03:00
|
|
|
(void) raise(SIGKILL);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
ztest_record_enospc(const char *s)
|
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) s;
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_shared->zs_enospc_count++;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static uint64_t
|
|
|
|
ztest_get_ashift(void)
|
|
|
|
{
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_ashift == 0)
|
2014-09-23 03:42:03 +04:00
|
|
|
return (SPA_MINBLOCKSHIFT + ztest_random(5));
|
2012-01-24 06:43:32 +04:00
|
|
|
return (ztest_opts.zo_ashift);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
static boolean_t
|
|
|
|
ztest_is_draid_spare(const char *name)
|
|
|
|
{
|
|
|
|
uint64_t spare_id = 0, parity = 0, vdev_id = 0;
|
|
|
|
|
2021-06-05 15:18:37 +03:00
|
|
|
if (sscanf(name, VDEV_TYPE_DRAID "%"PRIu64"-%"PRIu64"-%"PRIu64"",
|
|
|
|
&parity, &vdev_id, &spare_id) == 3) {
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
return (B_TRUE);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (B_FALSE);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static nvlist_t *
|
2022-04-19 21:38:30 +03:00
|
|
|
make_vdev_file(const char *path, const char *aux, const char *pool,
|
|
|
|
size_t size, uint64_t ashift)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2022-04-19 21:38:30 +03:00
|
|
|
char *pathbuf = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t vdev;
|
|
|
|
nvlist_t *file;
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
boolean_t draid_spare = B_FALSE;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
if (ashift == 0)
|
|
|
|
ashift = ztest_get_ashift();
|
|
|
|
|
|
|
|
if (path == NULL) {
|
2022-04-19 21:38:30 +03:00
|
|
|
pathbuf = umem_alloc(MAXPATHLEN, UMEM_NOFAIL);
|
2008-12-03 23:09:06 +03:00
|
|
|
path = pathbuf;
|
|
|
|
|
|
|
|
if (aux != NULL) {
|
|
|
|
vdev = ztest_shared->zs_vdev_aux;
|
2022-04-19 21:38:30 +03:00
|
|
|
(void) snprintf(pathbuf, MAXPATHLEN,
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_aux_template, ztest_opts.zo_dir,
|
2012-12-15 04:28:49 +04:00
|
|
|
pool == NULL ? ztest_opts.zo_pool : pool,
|
|
|
|
aux, vdev);
|
2008-12-03 23:09:06 +03:00
|
|
|
} else {
|
2010-05-29 00:45:14 +04:00
|
|
|
vdev = ztest_shared->zs_vdev_next_leaf++;
|
2022-04-19 21:38:30 +03:00
|
|
|
(void) snprintf(pathbuf, MAXPATHLEN,
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_dev_template, ztest_opts.zo_dir,
|
2012-12-15 04:28:49 +04:00
|
|
|
pool == NULL ? ztest_opts.zo_pool : pool, vdev);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
} else {
|
|
|
|
draid_spare = ztest_is_draid_spare(path);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
if (size != 0 && !draid_spare) {
|
2008-12-03 23:09:06 +03:00
|
|
|
int fd = open(path, O_RDWR | O_CREAT | O_TRUNC, 0666);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (fd == -1)
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_TRUE, "can't open %s", path);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (ftruncate(fd, size) != 0)
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_TRUE, "can't ftruncate %s", path);
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) close(fd);
|
|
|
|
}
|
|
|
|
|
2021-01-11 20:30:33 +03:00
|
|
|
file = fnvlist_alloc();
|
|
|
|
fnvlist_add_string(file, ZPOOL_CONFIG_TYPE,
|
|
|
|
draid_spare ? VDEV_TYPE_DRAID_SPARE : VDEV_TYPE_FILE);
|
|
|
|
fnvlist_add_string(file, ZPOOL_CONFIG_PATH, path);
|
|
|
|
fnvlist_add_uint64(file, ZPOOL_CONFIG_ASHIFT, ashift);
|
2010-08-26 22:13:05 +04:00
|
|
|
umem_free(pathbuf, MAXPATHLEN);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
return (file);
|
|
|
|
}
|
|
|
|
|
|
|
|
static nvlist_t *
|
2022-04-19 21:38:30 +03:00
|
|
|
make_vdev_raid(const char *path, const char *aux, const char *pool, size_t size,
|
2012-12-15 04:28:49 +04:00
|
|
|
uint64_t ashift, int r)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
nvlist_t *raid, **child;
|
2008-11-20 23:01:55 +03:00
|
|
|
int c;
|
|
|
|
|
|
|
|
if (r < 2)
|
2012-12-15 04:28:49 +04:00
|
|
|
return (make_vdev_file(path, aux, pool, size, ashift));
|
2008-11-20 23:01:55 +03:00
|
|
|
child = umem_alloc(r * sizeof (nvlist_t *), UMEM_NOFAIL);
|
|
|
|
|
|
|
|
for (c = 0; c < r; c++)
|
2012-12-15 04:28:49 +04:00
|
|
|
child[c] = make_vdev_file(path, aux, pool, size, ashift);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-01-11 20:30:33 +03:00
|
|
|
raid = fnvlist_alloc();
|
|
|
|
fnvlist_add_string(raid, ZPOOL_CONFIG_TYPE,
|
|
|
|
ztest_opts.zo_raid_type);
|
|
|
|
fnvlist_add_uint64(raid, ZPOOL_CONFIG_NPARITY,
|
|
|
|
ztest_opts.zo_raid_parity);
|
2021-12-07 04:19:13 +03:00
|
|
|
fnvlist_add_nvlist_array(raid, ZPOOL_CONFIG_CHILDREN,
|
|
|
|
(const nvlist_t **)child, r);
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
|
|
|
|
if (strcmp(ztest_opts.zo_raid_type, VDEV_TYPE_DRAID) == 0) {
|
|
|
|
uint64_t ndata = ztest_opts.zo_draid_data;
|
|
|
|
uint64_t nparity = ztest_opts.zo_raid_parity;
|
|
|
|
uint64_t nspares = ztest_opts.zo_draid_spares;
|
|
|
|
uint64_t children = ztest_opts.zo_raid_children;
|
|
|
|
uint64_t ngroups = 1;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Calculate the minimum number of groups required to fill a
|
|
|
|
* slice. This is the LCM of the stripe width (data + parity)
|
|
|
|
* and the number of data drives (children - spares).
|
|
|
|
*/
|
|
|
|
while (ngroups * (ndata + nparity) % (children - nspares) != 0)
|
|
|
|
ngroups++;
|
|
|
|
|
|
|
|
/* Store the basic dRAID configuration. */
|
|
|
|
fnvlist_add_uint64(raid, ZPOOL_CONFIG_DRAID_NDATA, ndata);
|
|
|
|
fnvlist_add_uint64(raid, ZPOOL_CONFIG_DRAID_NSPARES, nspares);
|
|
|
|
fnvlist_add_uint64(raid, ZPOOL_CONFIG_DRAID_NGROUPS, ngroups);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
for (c = 0; c < r; c++)
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(child[c]);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
umem_free(child, r * sizeof (nvlist_t *));
|
|
|
|
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
return (raid);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static nvlist_t *
|
2022-04-19 21:38:30 +03:00
|
|
|
make_vdev_mirror(const char *path, const char *aux, const char *pool,
|
|
|
|
size_t size, uint64_t ashift, int r, int m)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
nvlist_t *mirror, **child;
|
|
|
|
int c;
|
|
|
|
|
|
|
|
if (m < 1)
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
return (make_vdev_raid(path, aux, pool, size, ashift, r));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
child = umem_alloc(m * sizeof (nvlist_t *), UMEM_NOFAIL);
|
|
|
|
|
|
|
|
for (c = 0; c < m; c++)
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
child[c] = make_vdev_raid(path, aux, pool, size, ashift, r);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-01-11 20:30:33 +03:00
|
|
|
mirror = fnvlist_alloc();
|
|
|
|
fnvlist_add_string(mirror, ZPOOL_CONFIG_TYPE, VDEV_TYPE_MIRROR);
|
2021-12-07 04:19:13 +03:00
|
|
|
fnvlist_add_nvlist_array(mirror, ZPOOL_CONFIG_CHILDREN,
|
|
|
|
(const nvlist_t **)child, m);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
for (c = 0; c < m; c++)
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(child[c]);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
umem_free(child, m * sizeof (nvlist_t *));
|
|
|
|
|
|
|
|
return (mirror);
|
|
|
|
}
|
|
|
|
|
|
|
|
static nvlist_t *
|
2022-04-19 21:38:30 +03:00
|
|
|
make_vdev_root(const char *path, const char *aux, const char *pool, size_t size,
|
|
|
|
uint64_t ashift, const char *class, int r, int m, int t)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
nvlist_t *root, **child;
|
|
|
|
int c;
|
2018-09-06 04:33:36 +03:00
|
|
|
boolean_t log;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3S(t, >, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2018-09-06 04:33:36 +03:00
|
|
|
log = (class != NULL && strcmp(class, "log") == 0);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
child = umem_alloc(t * sizeof (nvlist_t *), UMEM_NOFAIL);
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
for (c = 0; c < t; c++) {
|
2012-12-15 04:28:49 +04:00
|
|
|
child[c] = make_vdev_mirror(path, aux, pool, size, ashift,
|
|
|
|
r, m);
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_add_uint64(child[c], ZPOOL_CONFIG_IS_LOG, log);
|
2018-09-06 04:33:36 +03:00
|
|
|
|
|
|
|
if (class != NULL && class[0] != '\0') {
|
|
|
|
ASSERT(m > 1 || log); /* expecting a mirror */
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_add_string(child[c],
|
|
|
|
ZPOOL_CONFIG_ALLOCATION_BIAS, class);
|
2018-09-06 04:33:36 +03:00
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-01-11 20:30:33 +03:00
|
|
|
root = fnvlist_alloc();
|
|
|
|
fnvlist_add_string(root, ZPOOL_CONFIG_TYPE, VDEV_TYPE_ROOT);
|
|
|
|
fnvlist_add_nvlist_array(root, aux ? aux : ZPOOL_CONFIG_CHILDREN,
|
2021-12-07 04:19:13 +03:00
|
|
|
(const nvlist_t **)child, t);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
for (c = 0; c < t; c++)
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(child[c]);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
umem_free(child, t * sizeof (nvlist_t *));
|
|
|
|
|
|
|
|
return (root);
|
|
|
|
}
|
|
|
|
|
2012-12-15 04:28:49 +04:00
|
|
|
/*
|
|
|
|
* Find a random spa version. Returns back a random spa version in the
|
|
|
|
* range [initial_version, SPA_VERSION_FEATURES].
|
|
|
|
*/
|
|
|
|
static uint64_t
|
|
|
|
ztest_random_spa_version(uint64_t initial_version)
|
|
|
|
{
|
|
|
|
uint64_t version = initial_version;
|
|
|
|
|
|
|
|
if (version <= SPA_VERSION_BEFORE_FEATURES) {
|
|
|
|
version = version +
|
|
|
|
ztest_random(SPA_VERSION_BEFORE_FEATURES - version + 1);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (version > SPA_VERSION_BEFORE_FEATURES)
|
|
|
|
version = SPA_VERSION_FEATURES;
|
|
|
|
|
|
|
|
ASSERT(SPA_VERSION_IS_SUPPORTED(version));
|
|
|
|
return (version);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static int
|
|
|
|
ztest_random_blocksize(void)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(ztest_spa->spa_max_ashift, !=, 0);
|
2018-09-06 04:33:36 +03:00
|
|
|
|
2014-11-03 23:15:08 +03:00
|
|
|
/*
|
|
|
|
* Choose a block size >= the ashift.
|
|
|
|
* If the SPA supports new MAXBLOCKSIZE, test up to 1MB blocks.
|
|
|
|
*/
|
|
|
|
int maxbs = SPA_OLD_MAXBLOCKSHIFT;
|
|
|
|
if (spa_maxblocksize(ztest_spa) == SPA_MAXBLOCKSIZE)
|
|
|
|
maxbs = 20;
|
2015-05-20 07:14:01 +03:00
|
|
|
uint64_t block_shift =
|
|
|
|
ztest_random(maxbs - ztest_spa->spa_max_ashift + 1);
|
2014-09-23 03:42:03 +04:00
|
|
|
return (1 << (SPA_MINBLOCKSHIFT + block_shift));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
static int
|
|
|
|
ztest_random_dnodesize(void)
|
|
|
|
{
|
|
|
|
int slots;
|
|
|
|
int max_slots = spa_maxdnodesize(ztest_spa) >> DNODE_SHIFT;
|
|
|
|
|
|
|
|
if (max_slots == DNODE_MIN_SLOTS)
|
|
|
|
return (DNODE_MIN_SIZE);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Weight the random distribution more heavily toward smaller
|
|
|
|
* dnode sizes since that is more likely to reflect real-world
|
|
|
|
* usage.
|
|
|
|
*/
|
|
|
|
ASSERT3U(max_slots, >, 4);
|
|
|
|
switch (ztest_random(10)) {
|
|
|
|
case 0:
|
|
|
|
slots = 5 + ztest_random(max_slots - 4);
|
|
|
|
break;
|
|
|
|
case 1 ... 4:
|
|
|
|
slots = 2 + ztest_random(3);
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
slots = 1;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return (slots << DNODE_SHIFT);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static int
|
|
|
|
ztest_random_ibshift(void)
|
|
|
|
{
|
|
|
|
return (DN_MIN_INDBLKSHIFT +
|
|
|
|
ztest_random(DN_MAX_INDBLKSHIFT - DN_MIN_INDBLKSHIFT + 1));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static uint64_t
|
|
|
|
ztest_random_vdev_top(spa_t *spa, boolean_t log_ok)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t top;
|
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
vdev_t *tvd;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
do {
|
|
|
|
top = ztest_random(rvd->vdev_children);
|
|
|
|
tvd = rvd->vdev_child[top];
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
} while (!vdev_is_concrete(tvd) || (tvd->vdev_islog && !log_ok) ||
|
2010-05-29 00:45:14 +04:00
|
|
|
tvd->vdev_mg == NULL || tvd->vdev_mg->mg_class == NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
return (top);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static uint64_t
|
|
|
|
ztest_random_dsl_prop(zfs_prop_t prop)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t value;
|
|
|
|
|
|
|
|
do {
|
|
|
|
value = zfs_prop_random_value(prop, ztest_random(-1ULL));
|
|
|
|
} while (prop == ZFS_PROP_CHECKSUM && value == ZIO_CHECKSUM_OFF);
|
|
|
|
|
|
|
|
return (value);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_dsl_prop_set_uint64(char *osname, zfs_prop_t prop, uint64_t value,
|
|
|
|
boolean_t inherit)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
const char *propname = zfs_prop_to_name(prop);
|
|
|
|
const char *valname;
|
2010-08-26 22:13:05 +04:00
|
|
|
char *setpoint;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t curval;
|
2008-11-20 23:01:55 +03:00
|
|
|
int error;
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dsl_prop_set_int(osname, propname,
|
|
|
|
(inherit ? ZPROP_SRC_NONE : ZPROP_SRC_LOCAL), value);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error == ENOSPC) {
|
|
|
|
ztest_record_enospc(FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
return (error);
|
|
|
|
}
|
2013-05-11 01:17:03 +04:00
|
|
|
ASSERT0(error);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
setpoint = umem_alloc(MAXPATHLEN, UMEM_NOFAIL);
|
2013-09-04 16:00:57 +04:00
|
|
|
VERIFY0(dsl_prop_get_integer(osname, propname, &curval, setpoint));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 6) {
|
2015-07-30 07:11:32 +03:00
|
|
|
int err;
|
|
|
|
|
|
|
|
err = zfs_prop_index_to_string(prop, curval, &valname);
|
|
|
|
if (err)
|
2016-12-12 21:46:26 +03:00
|
|
|
(void) printf("%s %s = %llu at '%s'\n", osname,
|
|
|
|
propname, (unsigned long long)curval, setpoint);
|
2015-07-30 07:11:32 +03:00
|
|
|
else
|
|
|
|
(void) printf("%s %s = %s at '%s'\n",
|
|
|
|
osname, propname, valname, setpoint);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-08-26 22:13:05 +04:00
|
|
|
umem_free(setpoint, MAXPATHLEN);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_spa_prop_set_uint64(zpool_prop_t prop, uint64_t value)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2012-01-24 06:43:32 +04:00
|
|
|
spa_t *spa = ztest_spa;
|
2010-05-29 00:45:14 +04:00
|
|
|
nvlist_t *props = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
int error;
|
|
|
|
|
2021-01-11 20:30:33 +03:00
|
|
|
props = fnvlist_alloc();
|
|
|
|
fnvlist_add_uint64(props, zpool_prop_to_name(prop), value);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
error = spa_prop_set(spa, props);
|
|
|
|
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(props);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (error == ENOSPC) {
|
|
|
|
ztest_record_enospc(FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
return (error);
|
|
|
|
}
|
2013-05-11 01:17:03 +04:00
|
|
|
ASSERT0(error);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2017-09-12 23:15:11 +03:00
|
|
|
static int
|
|
|
|
ztest_dmu_objset_own(const char *name, dmu_objset_type_t type,
|
2022-04-19 21:49:30 +03:00
|
|
|
boolean_t readonly, boolean_t decrypt, const void *tag, objset_t **osp)
|
2017-09-12 23:15:11 +03:00
|
|
|
{
|
|
|
|
int err;
|
2018-11-08 02:40:24 +03:00
|
|
|
char *cp = NULL;
|
|
|
|
char ddname[ZFS_MAX_DATASET_NAME_LEN];
|
|
|
|
|
Fix unsafe string operations
Coverity caught unsafe use of `strcpy()` in `ztest_dmu_objset_own()`,
`nfs_init_tmpfile()` and `dump_snapshot()`. It also caught an unsafe use
of `strlcat()` in `nfs_init_tmpfile()`.
Inspired by this, I did an audit of every single usage of `strcpy()` and
`strcat()` in the code. If I could not prove that the usage was safe, I
changed the code to use either `strlcpy()` or `strlcat()`, depending on
which function was originally used. In some cases, `snprintf()` was used
to replace multiple uses of `strcat` because it was cleaner.
Whenever I changed a function, I preferred to use `sizeof(dst)` when the
compiler is able to provide the string size via that. When it could not
because the string was passed by a caller, I checked the entire call
tree of the function to find out how big the buffer was and hard coded
it. Hardcoding is less than ideal, but it is safe unless someone shrinks
the buffer sizes being passed.
Additionally, Coverity reported three more string related issues:
* It caught a case where we do an overlapping memory copy in a call to
`snprintf()`. We fix that via `kmem_strdup()` and `kmem_strfree()`.
* It caught `sizeof (buf)` being used instead of `buflen` in
`zdb_nicenum()`'s call to `zfs_nicenum()`, which is passed to
`snprintf()`. We change that to pass `buflen`.
* It caught a theoretical unterminated string passed to `strcmp()`.
This one is likely a false positive, but we have the information
needed to do this more safely, so we change this to silence the false
positive not just in coverity, but potentially other static analysis
tools too. We switch to `strncmp()`.
* There was a false positive in tests/zfs-tests/cmd/dir_rd_update.c. We
suppress it by switching to `snprintf()` since other static analysis
tools might complain about it too. Interestingly, there is a possible
real bug there too, since it assumes that the passed directory path
ends with '/'. We add a '/' to fix that potential bug.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13913
2022-09-28 02:47:24 +03:00
|
|
|
strlcpy(ddname, name, sizeof (ddname));
|
2018-11-08 02:40:24 +03:00
|
|
|
cp = strchr(ddname, '@');
|
|
|
|
if (cp != NULL)
|
|
|
|
*cp = '\0';
|
2017-09-12 23:15:11 +03:00
|
|
|
|
|
|
|
err = dmu_objset_own(name, type, readonly, decrypt, tag, osp);
|
2018-11-08 02:40:24 +03:00
|
|
|
while (decrypt && err == EACCES) {
|
2017-09-12 23:15:11 +03:00
|
|
|
dsl_crypto_params_t *dcp;
|
|
|
|
nvlist_t *crypto_args = fnvlist_alloc();
|
|
|
|
|
|
|
|
fnvlist_add_uint8_array(crypto_args, "wkeydata",
|
|
|
|
(uint8_t *)ztest_wkeydata, WRAPPING_KEY_LEN);
|
|
|
|
VERIFY0(dsl_crypto_params_create_nvlist(DCP_CMD_NONE, NULL,
|
|
|
|
crypto_args, &dcp));
|
|
|
|
err = spa_keystore_load_wkey(ddname, dcp, B_FALSE);
|
2020-12-25 07:58:17 +03:00
|
|
|
/*
|
|
|
|
* Note: if there was an error loading, the wkey was not
|
|
|
|
* consumed, and needs to be freed.
|
|
|
|
*/
|
|
|
|
dsl_crypto_params_free(dcp, (err != 0));
|
2017-09-12 23:15:11 +03:00
|
|
|
fnvlist_free(crypto_args);
|
|
|
|
|
2018-11-08 02:40:24 +03:00
|
|
|
if (err == EINVAL) {
|
|
|
|
/*
|
|
|
|
* We couldn't load a key for this dataset so try
|
|
|
|
* the parent. This loop will eventually hit the
|
|
|
|
* encryption root since ztest only makes clones
|
|
|
|
* as children of their origin datasets.
|
|
|
|
*/
|
|
|
|
cp = strrchr(ddname, '/');
|
|
|
|
if (cp == NULL)
|
|
|
|
return (err);
|
|
|
|
|
|
|
|
*cp = '\0';
|
|
|
|
err = EACCES;
|
|
|
|
continue;
|
|
|
|
} else if (err != 0) {
|
|
|
|
break;
|
|
|
|
}
|
2017-09-12 23:15:11 +03:00
|
|
|
|
|
|
|
err = dmu_objset_own(name, type, readonly, decrypt, tag, osp);
|
2018-11-08 02:40:24 +03:00
|
|
|
break;
|
2017-09-12 23:15:11 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
return (err);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
ztest_rll_init(rll_t *rll)
|
|
|
|
{
|
|
|
|
rll->rll_writer = NULL;
|
|
|
|
rll->rll_readers = 0;
|
2010-08-26 21:43:27 +04:00
|
|
|
mutex_init(&rll->rll_lock, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
cv_init(&rll->rll_cv, NULL, CV_DEFAULT, NULL);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
ztest_rll_destroy(rll_t *rll)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3P(rll->rll_writer, ==, NULL);
|
|
|
|
ASSERT0(rll->rll_readers);
|
2010-08-26 21:43:27 +04:00
|
|
|
mutex_destroy(&rll->rll_lock);
|
|
|
|
cv_destroy(&rll->rll_cv);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
ztest_rll_lock(rll_t *rll, rl_type_t type)
|
|
|
|
{
|
2010-08-26 21:43:27 +04:00
|
|
|
mutex_enter(&rll->rll_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
if (type == ZTRL_READER) {
|
2010-05-29 00:45:14 +04:00
|
|
|
while (rll->rll_writer != NULL)
|
2010-08-26 21:43:27 +04:00
|
|
|
(void) cv_wait(&rll->rll_cv, &rll->rll_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
rll->rll_readers++;
|
|
|
|
} else {
|
|
|
|
while (rll->rll_writer != NULL || rll->rll_readers)
|
2010-08-26 21:43:27 +04:00
|
|
|
(void) cv_wait(&rll->rll_cv, &rll->rll_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
rll->rll_writer = curthread;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 21:43:27 +04:00
|
|
|
mutex_exit(&rll->rll_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
ztest_rll_unlock(rll_t *rll)
|
|
|
|
{
|
2010-08-26 21:43:27 +04:00
|
|
|
mutex_enter(&rll->rll_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (rll->rll_writer) {
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT0(rll->rll_readers);
|
2010-05-29 00:45:14 +04:00
|
|
|
rll->rll_writer = NULL;
|
|
|
|
} else {
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3S(rll->rll_readers, >, 0);
|
|
|
|
ASSERT3P(rll->rll_writer, ==, NULL);
|
2010-05-29 00:45:14 +04:00
|
|
|
rll->rll_readers--;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (rll->rll_writer == NULL && rll->rll_readers == 0)
|
2010-08-26 21:43:27 +04:00
|
|
|
cv_broadcast(&rll->rll_cv);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-26 21:43:27 +04:00
|
|
|
mutex_exit(&rll->rll_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
ztest_object_lock(ztest_ds_t *zd, uint64_t object, rl_type_t type)
|
2008-12-03 23:09:06 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
rll_t *rll = &zd->zd_object_lock[object & (ZTEST_OBJECT_LOCKS - 1)];
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_rll_lock(rll, type);
|
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
ztest_object_unlock(ztest_ds_t *zd, uint64_t object)
|
|
|
|
{
|
|
|
|
rll_t *rll = &zd->zd_object_lock[object & (ZTEST_OBJECT_LOCKS - 1)];
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_rll_unlock(rll);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
|
|
|
|
2018-10-02 01:13:12 +03:00
|
|
|
static rl_t *
|
2016-01-05 23:47:58 +03:00
|
|
|
ztest_range_lock(ztest_ds_t *zd, uint64_t object, uint64_t offset,
|
|
|
|
uint64_t size, rl_type_t type)
|
|
|
|
{
|
2018-10-02 01:13:12 +03:00
|
|
|
uint64_t hash = object ^ (offset % (ZTEST_RANGE_LOCKS + 1));
|
|
|
|
rll_t *rll = &zd->zd_range_lock[hash & (ZTEST_RANGE_LOCKS - 1)];
|
|
|
|
rl_t *rl;
|
|
|
|
|
|
|
|
rl = umem_alloc(sizeof (*rl), UMEM_NOFAIL);
|
|
|
|
rl->rl_object = object;
|
|
|
|
rl->rl_offset = offset;
|
|
|
|
rl->rl_size = size;
|
|
|
|
rl->rl_lock = rll;
|
|
|
|
|
|
|
|
ztest_rll_lock(rll, type);
|
|
|
|
|
|
|
|
return (rl);
|
2016-01-05 23:47:58 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-01-05 23:47:58 +03:00
|
|
|
static void
|
2018-10-02 01:13:12 +03:00
|
|
|
ztest_range_unlock(rl_t *rl)
|
2016-01-05 23:47:58 +03:00
|
|
|
{
|
2018-10-02 01:13:12 +03:00
|
|
|
rll_t *rll = rl->rl_lock;
|
|
|
|
|
|
|
|
ztest_rll_unlock(rll);
|
|
|
|
|
|
|
|
umem_free(rl, sizeof (*rl));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_zd_init(ztest_ds_t *zd, ztest_shared_ds_t *szd, objset_t *os)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
|
|
|
zd->zd_os = os;
|
|
|
|
zd->zd_zilog = dmu_objset_zil(os);
|
2012-01-24 06:43:32 +04:00
|
|
|
zd->zd_shared = szd;
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_objset_name(os, zd->zd_name);
|
2010-08-26 20:52:39 +04:00
|
|
|
int l;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (zd->zd_shared != NULL)
|
|
|
|
zd->zd_shared->zd_seq = 0;
|
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
VERIFY0(pthread_rwlock_init(&zd->zd_zilog_lock, NULL));
|
2010-08-26 21:43:27 +04:00
|
|
|
mutex_init(&zd->zd_dirobj_lock, NULL, MUTEX_DEFAULT, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (l = 0; l < ZTEST_OBJECT_LOCKS; l++)
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_rll_init(&zd->zd_object_lock[l]);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (l = 0; l < ZTEST_RANGE_LOCKS; l++)
|
2018-10-02 01:13:12 +03:00
|
|
|
ztest_rll_init(&zd->zd_range_lock[l]);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
ztest_zd_fini(ztest_ds_t *zd)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-08-26 20:52:39 +04:00
|
|
|
int l;
|
|
|
|
|
2010-08-26 21:43:27 +04:00
|
|
|
mutex_destroy(&zd->zd_dirobj_lock);
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_destroy(&zd->zd_zilog_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (l = 0; l < ZTEST_OBJECT_LOCKS; l++)
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_rll_destroy(&zd->zd_object_lock[l]);
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (l = 0; l < ZTEST_RANGE_LOCKS; l++)
|
2018-10-02 01:13:12 +03:00
|
|
|
ztest_rll_destroy(&zd->zd_range_lock[l]);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
#define TXG_MIGHTWAIT (ztest_random(10) == 0 ? TXG_NOWAIT : TXG_WAIT)
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static uint64_t
|
|
|
|
ztest_tx_assign(dmu_tx_t *tx, uint64_t txg_how, const char *tag)
|
|
|
|
{
|
|
|
|
uint64_t txg;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Attempt to assign tx to some transaction group.
|
|
|
|
*/
|
|
|
|
error = dmu_tx_assign(tx, txg_how);
|
|
|
|
if (error) {
|
|
|
|
if (error == ERESTART) {
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(txg_how, ==, TXG_NOWAIT);
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_tx_wait(tx);
|
|
|
|
} else {
|
|
|
|
ASSERT3U(error, ==, ENOSPC);
|
|
|
|
ztest_record_enospc(tag);
|
|
|
|
}
|
|
|
|
dmu_tx_abort(tx);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
txg = dmu_tx_get_txg(tx);
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(txg, !=, 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
return (txg);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
ztest_bt_generate(ztest_block_tag_t *bt, objset_t *os, uint64_t object,
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
uint64_t dnodesize, uint64_t offset, uint64_t gen, uint64_t txg,
|
|
|
|
uint64_t crtxg)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
|
|
|
bt->bt_magic = BT_MAGIC;
|
|
|
|
bt->bt_objset = dmu_objset_id(os);
|
|
|
|
bt->bt_object = object;
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
bt->bt_dnodesize = dnodesize;
|
2010-05-29 00:45:14 +04:00
|
|
|
bt->bt_offset = offset;
|
|
|
|
bt->bt_gen = gen;
|
|
|
|
bt->bt_txg = txg;
|
|
|
|
bt->bt_crtxg = crtxg;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
ztest_bt_verify(ztest_block_tag_t *bt, objset_t *os, uint64_t object,
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
uint64_t dnodesize, uint64_t offset, uint64_t gen, uint64_t txg,
|
|
|
|
uint64_t crtxg)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2014-06-06 01:19:08 +04:00
|
|
|
ASSERT3U(bt->bt_magic, ==, BT_MAGIC);
|
|
|
|
ASSERT3U(bt->bt_objset, ==, dmu_objset_id(os));
|
|
|
|
ASSERT3U(bt->bt_object, ==, object);
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ASSERT3U(bt->bt_dnodesize, ==, dnodesize);
|
2014-06-06 01:19:08 +04:00
|
|
|
ASSERT3U(bt->bt_offset, ==, offset);
|
|
|
|
ASSERT3U(bt->bt_gen, <=, gen);
|
|
|
|
ASSERT3U(bt->bt_txg, <=, txg);
|
|
|
|
ASSERT3U(bt->bt_crtxg, ==, crtxg);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static ztest_block_tag_t *
|
|
|
|
ztest_bt_bonus(dmu_buf_t *db)
|
|
|
|
{
|
|
|
|
dmu_object_info_t doi;
|
|
|
|
ztest_block_tag_t *bt;
|
|
|
|
|
|
|
|
dmu_object_info_from_db(db, &doi);
|
|
|
|
ASSERT3U(doi.doi_bonus_size, <=, db->db_size);
|
|
|
|
ASSERT3U(doi.doi_bonus_size, >=, sizeof (*bt));
|
|
|
|
bt = (void *)((char *)db->db_data + doi.doi_bonus_size - sizeof (*bt));
|
|
|
|
|
|
|
|
return (bt);
|
|
|
|
}
|
|
|
|
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
/*
|
|
|
|
* Generate a token to fill up unused bonus buffer space. Try to make
|
|
|
|
* it unique to the object, generation, and offset to verify that data
|
|
|
|
* is not getting overwritten by data from other dnodes.
|
|
|
|
*/
|
|
|
|
#define ZTEST_BONUS_FILL_TOKEN(obj, ds, gen, offset) \
|
|
|
|
(((ds) << 48) | ((gen) << 32) | ((obj) << 8) | (offset))
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Fill up the unused bonus buffer region before the block tag with a
|
|
|
|
* verifiable pattern. Filling the whole bonus area with non-zero data
|
|
|
|
* helps ensure that all dnode traversal code properly skips the
|
|
|
|
* interior regions of large dnodes.
|
|
|
|
*/
|
2020-06-15 21:30:37 +03:00
|
|
|
static void
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_fill_unused_bonus(dmu_buf_t *db, void *end, uint64_t obj,
|
|
|
|
objset_t *os, uint64_t gen)
|
|
|
|
{
|
|
|
|
uint64_t *bonusp;
|
|
|
|
|
|
|
|
ASSERT(IS_P2ALIGNED((char *)end - (char *)db->db_data, 8));
|
|
|
|
|
|
|
|
for (bonusp = db->db_data; bonusp < (uint64_t *)end; bonusp++) {
|
|
|
|
uint64_t token = ZTEST_BONUS_FILL_TOKEN(obj, dmu_objset_id(os),
|
|
|
|
gen, bonusp - (uint64_t *)db->db_data);
|
|
|
|
*bonusp = token;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that the unused area of a bonus buffer is filled with the
|
|
|
|
* expected tokens.
|
|
|
|
*/
|
2020-06-15 21:30:37 +03:00
|
|
|
static void
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_verify_unused_bonus(dmu_buf_t *db, void *end, uint64_t obj,
|
|
|
|
objset_t *os, uint64_t gen)
|
|
|
|
{
|
|
|
|
uint64_t *bonusp;
|
|
|
|
|
|
|
|
for (bonusp = db->db_data; bonusp < (uint64_t *)end; bonusp++) {
|
|
|
|
uint64_t token = ZTEST_BONUS_FILL_TOKEN(obj, dmu_objset_id(os),
|
|
|
|
gen, bonusp - (uint64_t *)db->db_data);
|
|
|
|
VERIFY3U(*bonusp, ==, token);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* ZIL logging ops
|
|
|
|
*/
|
|
|
|
|
|
|
|
#define lrz_type lr_mode
|
|
|
|
#define lrz_blocksize lr_uid
|
|
|
|
#define lrz_ibshift lr_gid
|
|
|
|
#define lrz_bonustype lr_rdev
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
#define lrz_dnodesize lr_crtime[1]
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
static void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_log_create(ztest_ds_t *zd, dmu_tx_t *tx, lr_create_t *lr)
|
|
|
|
{
|
2024-09-27 19:18:11 +03:00
|
|
|
char *name = (char *)&lr->lr_data[0]; /* name follows lr */
|
2010-05-29 00:45:14 +04:00
|
|
|
size_t namesize = strlen(name) + 1;
|
|
|
|
itx_t *itx;
|
|
|
|
|
|
|
|
if (zil_replaying(zd->zd_zilog, tx))
|
2010-08-27 01:24:34 +04:00
|
|
|
return;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
itx = zil_itx_create(TX_CREATE, sizeof (*lr) + namesize);
|
2024-09-27 19:18:11 +03:00
|
|
|
memcpy(&itx->itx_lr + 1, &lr->lr_create.lr_common + 1,
|
2010-05-29 00:45:14 +04:00
|
|
|
sizeof (*lr) + namesize - sizeof (lr_t));
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
zil_itx_assign(zd->zd_zilog, itx, tx);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
static void
|
|
|
|
ztest_log_remove(ztest_ds_t *zd, dmu_tx_t *tx, lr_remove_t *lr, uint64_t object)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2024-09-27 19:18:11 +03:00
|
|
|
char *name = (char *)&lr->lr_data[0]; /* name follows lr */
|
2010-05-29 00:45:14 +04:00
|
|
|
size_t namesize = strlen(name) + 1;
|
|
|
|
itx_t *itx;
|
|
|
|
|
|
|
|
if (zil_replaying(zd->zd_zilog, tx))
|
2010-08-27 01:24:34 +04:00
|
|
|
return;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
itx = zil_itx_create(TX_REMOVE, sizeof (*lr) + namesize);
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(&itx->itx_lr + 1, &lr->lr_common + 1,
|
2010-05-29 00:45:14 +04:00
|
|
|
sizeof (*lr) + namesize - sizeof (lr_t));
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
itx->itx_oid = object;
|
|
|
|
zil_itx_assign(zd->zd_zilog, itx, tx);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
static void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_log_write(ztest_ds_t *zd, dmu_tx_t *tx, lr_write_t *lr)
|
|
|
|
{
|
|
|
|
itx_t *itx;
|
|
|
|
itx_wr_state_t write_state = ztest_random(WR_NUM_STATES);
|
|
|
|
|
|
|
|
if (zil_replaying(zd->zd_zilog, tx))
|
2010-08-27 01:24:34 +04:00
|
|
|
return;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2023-03-10 22:59:53 +03:00
|
|
|
if (lr->lr_length > zil_max_log_data(zd->zd_zilog, sizeof (lr_write_t)))
|
2010-05-29 00:45:14 +04:00
|
|
|
write_state = WR_INDIRECT;
|
|
|
|
|
|
|
|
itx = zil_itx_create(TX_WRITE,
|
|
|
|
sizeof (*lr) + (write_state == WR_COPIED ? lr->lr_length : 0));
|
|
|
|
|
|
|
|
if (write_state == WR_COPIED &&
|
|
|
|
dmu_read(zd->zd_os, lr->lr_foid, lr->lr_offset, lr->lr_length,
|
|
|
|
((lr_write_t *)&itx->itx_lr) + 1, DMU_READ_NO_PREFETCH) != 0) {
|
|
|
|
zil_itx_destroy(itx);
|
|
|
|
itx = zil_itx_create(TX_WRITE, sizeof (*lr));
|
|
|
|
write_state = WR_NEED_COPY;
|
|
|
|
}
|
|
|
|
itx->itx_private = zd;
|
|
|
|
itx->itx_wr_state = write_state;
|
|
|
|
itx->itx_sync = (ztest_random(8) == 0);
|
|
|
|
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(&itx->itx_lr + 1, &lr->lr_common + 1,
|
2010-05-29 00:45:14 +04:00
|
|
|
sizeof (*lr) - sizeof (lr_t));
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
zil_itx_assign(zd->zd_zilog, itx, tx);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
static void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_log_truncate(ztest_ds_t *zd, dmu_tx_t *tx, lr_truncate_t *lr)
|
|
|
|
{
|
|
|
|
itx_t *itx;
|
|
|
|
|
|
|
|
if (zil_replaying(zd->zd_zilog, tx))
|
2010-08-27 01:24:34 +04:00
|
|
|
return;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
itx = zil_itx_create(TX_TRUNCATE, sizeof (*lr));
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(&itx->itx_lr + 1, &lr->lr_common + 1,
|
2010-05-29 00:45:14 +04:00
|
|
|
sizeof (*lr) - sizeof (lr_t));
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
itx->itx_sync = B_FALSE;
|
|
|
|
zil_itx_assign(zd->zd_zilog, itx, tx);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
static void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_log_setattr(ztest_ds_t *zd, dmu_tx_t *tx, lr_setattr_t *lr)
|
|
|
|
{
|
|
|
|
itx_t *itx;
|
|
|
|
|
|
|
|
if (zil_replaying(zd->zd_zilog, tx))
|
2010-08-27 01:24:34 +04:00
|
|
|
return;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
itx = zil_itx_create(TX_SETATTR, sizeof (*lr));
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(&itx->itx_lr + 1, &lr->lr_common + 1,
|
2010-05-29 00:45:14 +04:00
|
|
|
sizeof (*lr) - sizeof (lr_t));
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
itx->itx_sync = B_FALSE;
|
|
|
|
zil_itx_assign(zd->zd_zilog, itx, tx);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ZIL replay ops
|
|
|
|
*/
|
|
|
|
static int
|
2017-10-27 22:46:35 +03:00
|
|
|
ztest_replay_create(void *arg1, void *arg2, boolean_t byteswap)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2017-10-27 22:46:35 +03:00
|
|
|
ztest_ds_t *zd = arg1;
|
2024-09-27 19:18:11 +03:00
|
|
|
lr_create_t *lrc = arg2;
|
|
|
|
_lr_create_t *lr = &lrc->lr_create;
|
|
|
|
char *name = (char *)&lrc->lr_data[0]; /* name follows lr */
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *os = zd->zd_os;
|
|
|
|
ztest_block_tag_t *bbt;
|
|
|
|
dmu_buf_t *db;
|
|
|
|
dmu_tx_t *tx;
|
|
|
|
uint64_t txg;
|
|
|
|
int error = 0;
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
int bonuslen;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (byteswap)
|
|
|
|
byteswap_uint64_array(lr, sizeof (*lr));
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(lr->lr_doid, ==, ZTEST_DIROBJ);
|
|
|
|
ASSERT3S(name[0], !=, '\0');
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
tx = dmu_tx_create(os);
|
|
|
|
|
|
|
|
dmu_tx_hold_zap(tx, lr->lr_doid, B_TRUE, name);
|
|
|
|
|
|
|
|
if (lr->lrz_type == DMU_OT_ZAP_OTHER) {
|
|
|
|
dmu_tx_hold_zap(tx, DMU_NEW_OBJECT, B_TRUE, NULL);
|
|
|
|
} else {
|
|
|
|
dmu_tx_hold_bonus(tx, DMU_NEW_OBJECT);
|
|
|
|
}
|
|
|
|
|
|
|
|
txg = ztest_tx_assign(tx, TXG_WAIT, FTAG);
|
|
|
|
if (txg == 0)
|
|
|
|
return (ENOSPC);
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(dmu_objset_zil(os)->zl_replay, ==, !!lr->lr_foid);
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
bonuslen = DN_BONUS_SIZE(lr->lrz_dnodesize);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (lr->lrz_type == DMU_OT_ZAP_OTHER) {
|
|
|
|
if (lr->lr_foid == 0) {
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
lr->lr_foid = zap_create_dnsize(os,
|
2010-05-29 00:45:14 +04:00
|
|
|
lr->lrz_type, lr->lrz_bonustype,
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
bonuslen, lr->lrz_dnodesize, tx);
|
2010-05-29 00:45:14 +04:00
|
|
|
} else {
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
error = zap_create_claim_dnsize(os, lr->lr_foid,
|
2010-05-29 00:45:14 +04:00
|
|
|
lr->lrz_type, lr->lrz_bonustype,
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
bonuslen, lr->lrz_dnodesize, tx);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
} else {
|
|
|
|
if (lr->lr_foid == 0) {
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
lr->lr_foid = dmu_object_alloc_dnsize(os,
|
2010-05-29 00:45:14 +04:00
|
|
|
lr->lrz_type, 0, lr->lrz_bonustype,
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
bonuslen, lr->lrz_dnodesize, tx);
|
2010-05-29 00:45:14 +04:00
|
|
|
} else {
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
error = dmu_object_claim_dnsize(os, lr->lr_foid,
|
2010-05-29 00:45:14 +04:00
|
|
|
lr->lrz_type, 0, lr->lrz_bonustype,
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
bonuslen, lr->lrz_dnodesize, tx);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (error) {
|
|
|
|
ASSERT3U(error, ==, EEXIST);
|
|
|
|
ASSERT(zd->zd_zilog->zl_replay);
|
|
|
|
dmu_tx_commit(tx);
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(lr->lr_foid, !=, 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (lr->lrz_type != DMU_OT_ZAP_OTHER)
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_object_set_blocksize(os, lr->lr_foid,
|
2010-05-29 00:45:14 +04:00
|
|
|
lr->lrz_blocksize, lr->lrz_ibshift, tx));
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_bonus_hold(os, lr->lr_foid, FTAG, &db));
|
2010-05-29 00:45:14 +04:00
|
|
|
bbt = ztest_bt_bonus(db);
|
|
|
|
dmu_buf_will_dirty(db, tx);
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_bt_generate(bbt, os, lr->lr_foid, lr->lrz_dnodesize, -1ULL,
|
|
|
|
lr->lr_gen, txg, txg);
|
|
|
|
ztest_fill_unused_bonus(db, bbt, lr->lr_foid, os, lr->lr_gen);
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_buf_rele(db, FTAG);
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(zap_add(os, lr->lr_doid, name, sizeof (uint64_t), 1,
|
2010-05-29 00:45:14 +04:00
|
|
|
&lr->lr_foid, tx));
|
|
|
|
|
2024-09-27 19:18:11 +03:00
|
|
|
(void) ztest_log_create(zd, tx, lrc);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
dmu_tx_commit(tx);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
2017-10-27 22:46:35 +03:00
|
|
|
ztest_replay_remove(void *arg1, void *arg2, boolean_t byteswap)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2017-10-27 22:46:35 +03:00
|
|
|
ztest_ds_t *zd = arg1;
|
|
|
|
lr_remove_t *lr = arg2;
|
2024-09-27 19:18:11 +03:00
|
|
|
char *name = (char *)&lr->lr_data[0]; /* name follows lr */
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *os = zd->zd_os;
|
|
|
|
dmu_object_info_t doi;
|
|
|
|
dmu_tx_t *tx;
|
|
|
|
uint64_t object, txg;
|
|
|
|
|
|
|
|
if (byteswap)
|
|
|
|
byteswap_uint64_array(lr, sizeof (*lr));
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(lr->lr_doid, ==, ZTEST_DIROBJ);
|
|
|
|
ASSERT3S(name[0], !=, '\0');
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(
|
2010-05-29 00:45:14 +04:00
|
|
|
zap_lookup(os, lr->lr_doid, name, sizeof (object), 1, &object));
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(object, !=, 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
ztest_object_lock(zd, object, ZTRL_WRITER);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_object_info(os, object, &doi));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
tx = dmu_tx_create(os);
|
|
|
|
|
|
|
|
dmu_tx_hold_zap(tx, lr->lr_doid, B_FALSE, name);
|
|
|
|
dmu_tx_hold_free(tx, object, 0, DMU_OBJECT_END);
|
|
|
|
|
|
|
|
txg = ztest_tx_assign(tx, TXG_WAIT, FTAG);
|
|
|
|
if (txg == 0) {
|
|
|
|
ztest_object_unlock(zd, object);
|
|
|
|
return (ENOSPC);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (doi.doi_type == DMU_OT_ZAP_OTHER) {
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(zap_destroy(os, object, tx));
|
2010-05-29 00:45:14 +04:00
|
|
|
} else {
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_object_free(os, object, tx));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(zap_remove(os, lr->lr_doid, name, tx));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
(void) ztest_log_remove(zd, tx, lr, object);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
dmu_tx_commit(tx);
|
|
|
|
|
|
|
|
ztest_object_unlock(zd, object);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
2017-10-27 22:46:35 +03:00
|
|
|
ztest_replay_write(void *arg1, void *arg2, boolean_t byteswap)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2017-10-27 22:46:35 +03:00
|
|
|
ztest_ds_t *zd = arg1;
|
|
|
|
lr_write_t *lr = arg2;
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *os = zd->zd_os;
|
2024-09-27 19:18:11 +03:00
|
|
|
uint8_t *data = &lr->lr_data[0]; /* data follows lr */
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t offset, length;
|
2024-09-27 19:18:11 +03:00
|
|
|
ztest_block_tag_t *bt = (ztest_block_tag_t *)data;
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_block_tag_t *bbt;
|
|
|
|
uint64_t gen, txg, lrtxg, crtxg;
|
|
|
|
dmu_object_info_t doi;
|
|
|
|
dmu_tx_t *tx;
|
|
|
|
dmu_buf_t *db;
|
|
|
|
arc_buf_t *abuf = NULL;
|
2018-10-02 01:13:12 +03:00
|
|
|
rl_t *rl;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (byteswap)
|
|
|
|
byteswap_uint64_array(lr, sizeof (*lr));
|
|
|
|
|
|
|
|
offset = lr->lr_offset;
|
|
|
|
length = lr->lr_length;
|
|
|
|
|
|
|
|
/* If it's a dmu_sync() block, write the whole block */
|
|
|
|
if (lr->lr_common.lrc_reclen == sizeof (lr_write_t)) {
|
|
|
|
uint64_t blocksize = BP_GET_LSIZE(&lr->lr_blkptr);
|
|
|
|
if (length < blocksize) {
|
|
|
|
offset -= offset % blocksize;
|
|
|
|
length = blocksize;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (bt->bt_magic == BSWAP_64(BT_MAGIC))
|
|
|
|
byteswap_uint64_array(bt, sizeof (*bt));
|
|
|
|
|
|
|
|
if (bt->bt_magic != BT_MAGIC)
|
|
|
|
bt = NULL;
|
|
|
|
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
ztest_object_lock(zd, lr->lr_foid, ZTRL_READER);
|
|
|
|
rl = ztest_range_lock(zd, lr->lr_foid, offset, length, ZTRL_WRITER);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_bonus_hold(os, lr->lr_foid, FTAG, &db));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
dmu_object_info_from_db(db, &doi);
|
|
|
|
|
|
|
|
bbt = ztest_bt_bonus(db);
|
|
|
|
ASSERT3U(bbt->bt_magic, ==, BT_MAGIC);
|
|
|
|
gen = bbt->bt_gen;
|
|
|
|
crtxg = bbt->bt_crtxg;
|
|
|
|
lrtxg = lr->lr_common.lrc_txg;
|
|
|
|
|
|
|
|
tx = dmu_tx_create(os);
|
|
|
|
|
|
|
|
dmu_tx_hold_write(tx, lr->lr_foid, offset, length);
|
|
|
|
|
|
|
|
if (ztest_random(8) == 0 && length == doi.doi_data_block_size &&
|
|
|
|
P2PHASE(offset, length) == 0)
|
|
|
|
abuf = dmu_request_arcbuf(db, length);
|
|
|
|
|
|
|
|
txg = ztest_tx_assign(tx, TXG_WAIT, FTAG);
|
|
|
|
if (txg == 0) {
|
|
|
|
if (abuf != NULL)
|
|
|
|
dmu_return_arcbuf(abuf);
|
|
|
|
dmu_buf_rele(db, FTAG);
|
2018-10-02 01:13:12 +03:00
|
|
|
ztest_range_unlock(rl);
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_object_unlock(zd, lr->lr_foid);
|
|
|
|
return (ENOSPC);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (bt != NULL) {
|
|
|
|
/*
|
|
|
|
* Usually, verify the old data before writing new data --
|
|
|
|
* but not always, because we also want to verify correct
|
|
|
|
* behavior when the data was not recently read into cache.
|
|
|
|
*/
|
2022-11-03 19:58:14 +03:00
|
|
|
ASSERT(doi.doi_data_block_size);
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT0(offset % doi.doi_data_block_size);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (ztest_random(4) != 0) {
|
|
|
|
int prefetch = ztest_random(2) ?
|
|
|
|
DMU_READ_PREFETCH : DMU_READ_NO_PREFETCH;
|
Adding Direct IO Support
Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.
O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated TXG is synced.
For both O_DIRECT read and write request the offset and request sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned unless the direct property is set to always (see
below).
For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,
then it will be discarded from the ARC forcing all further reads to
retrieve the data from disk.
For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event
that the requested data is in buffered (in the ARC) it will just be
copied from the ARC into the user buffer.
For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in
the event that file contents are mmap'ed. In this case, all requests
that are at least PAGE_SIZE aligned will just fall back to the buffered
paths. If the request however is not PAGE_SIZE aligned, EINVAL will
be returned as always regardless if the file's contents are mmap'ed.
Since O_DIRECT writes go through the normal ZIO pipeline, the
following operations are supported just as with normal buffered writes:
Checksum
Compression
Encryption
Erasure Coding
There is one caveat for the data integrity of O_DIRECT writes that is
distinct for each of the OS's supported by ZFS.
FreeBSD - FreeBSD is able to place user pages under write protection so
any data in the user buffers and written directly down to the
VDEV disks is guaranteed to not change. There is no concern
with data integrity and O_DIRECT writes.
Linux - Linux is not able to place anonymous user pages under write
protection. Because of this, if the user decides to manipulate
the page contents while the write operation is occurring, data
integrity can not be guaranteed. However, there is a module
parameter `zfs_vdev_direct_write_verify` that controls the
if a O_DIRECT writes that can occur to a top-level VDEV before
a checksum verify is run before the contents of the I/O buffer
are committed to disk. In the event of a checksum verification
failure the write will return EIO. The number of O_DIRECT write
checksum verification errors can be observed by doing
`zpool status -d`, which will list all verification errors that
have occurred on a top-level VDEV. Along with `zpool status`, a
ZED event will be issues as `dio_verify` when a checksum
verification error occurs.
ZVOLs and dedup is not currently supported with Direct I/O.
A new dataset property `direct` has been added with the following 3
allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats
the request as a buffered IO request.
standard - Follows the alignment restrictions outlined above for
write/read IO requests when the O_DIRECT flag is used.
always - Treats every write/read IO request as though it passed
O_DIRECT and will do O_DIRECT if the alignment restrictions
are met otherwise will redirect through the ARC. This
property will not allow a request to fail.
There is also a module parameter zfs_dio_enabled that can be used to
force all reads and writes through the ARC. By setting this module
parameter to 0, it mimics as if the direct dataset property is set to
disabled.
Reviewed-by: Brian Behlendorf <behlendorf@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Co-authored-by: Mark Maybee <mark.maybee@delphix.com>
Co-authored-by: Matt Macy <mmacy@FreeBSD.org>
Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov>
Closes #10018
2024-09-14 23:47:59 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We will randomly set when to do O_DIRECT on a read.
|
|
|
|
*/
|
|
|
|
if (ztest_random(4) == 0)
|
|
|
|
prefetch |= DMU_DIRECTIO;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_block_tag_t rbt;
|
|
|
|
|
|
|
|
VERIFY(dmu_read(os, lr->lr_foid, offset,
|
|
|
|
sizeof (rbt), &rbt, prefetch) == 0);
|
|
|
|
if (rbt.bt_magic == BT_MAGIC) {
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_bt_verify(&rbt, os, lr->lr_foid, 0,
|
2010-05-29 00:45:14 +04:00
|
|
|
offset, gen, txg, crtxg);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Writes can appear to be newer than the bonus buffer because
|
|
|
|
* the ztest_get_data() callback does a dmu_read() of the
|
|
|
|
* open-context data, which may be different than the data
|
|
|
|
* as it was when the write was generated.
|
|
|
|
*/
|
|
|
|
if (zd->zd_zilog->zl_replay) {
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_bt_verify(bt, os, lr->lr_foid, 0, offset,
|
2010-05-29 00:45:14 +04:00
|
|
|
MAX(gen, bt->bt_gen), MAX(txg, lrtxg),
|
|
|
|
bt->bt_crtxg);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Set the bt's gen/txg to the bonus buffer's gen/txg
|
|
|
|
* so that all of the usual ASSERTs will work.
|
|
|
|
*/
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_bt_generate(bt, os, lr->lr_foid, 0, offset, gen, txg,
|
|
|
|
crtxg);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
if (abuf == NULL) {
|
|
|
|
dmu_write(os, lr->lr_foid, offset, length, data, tx);
|
|
|
|
} else {
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(abuf->b_data, data, length);
|
2022-09-24 02:52:03 +03:00
|
|
|
VERIFY0(dmu_assign_arcbuf_by_dbuf(db, offset, abuf, tx));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
(void) ztest_log_write(zd, tx, lr);
|
|
|
|
|
|
|
|
dmu_buf_rele(db, FTAG);
|
|
|
|
|
|
|
|
dmu_tx_commit(tx);
|
|
|
|
|
2018-10-02 01:13:12 +03:00
|
|
|
ztest_range_unlock(rl);
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_object_unlock(zd, lr->lr_foid);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
2017-10-27 22:46:35 +03:00
|
|
|
ztest_replay_truncate(void *arg1, void *arg2, boolean_t byteswap)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2017-10-27 22:46:35 +03:00
|
|
|
ztest_ds_t *zd = arg1;
|
|
|
|
lr_truncate_t *lr = arg2;
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *os = zd->zd_os;
|
|
|
|
dmu_tx_t *tx;
|
|
|
|
uint64_t txg;
|
2018-10-02 01:13:12 +03:00
|
|
|
rl_t *rl;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (byteswap)
|
|
|
|
byteswap_uint64_array(lr, sizeof (*lr));
|
|
|
|
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
ztest_object_lock(zd, lr->lr_foid, ZTRL_READER);
|
2010-05-29 00:45:14 +04:00
|
|
|
rl = ztest_range_lock(zd, lr->lr_foid, lr->lr_offset, lr->lr_length,
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
ZTRL_WRITER);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
tx = dmu_tx_create(os);
|
|
|
|
|
|
|
|
dmu_tx_hold_free(tx, lr->lr_foid, lr->lr_offset, lr->lr_length);
|
|
|
|
|
|
|
|
txg = ztest_tx_assign(tx, TXG_WAIT, FTAG);
|
|
|
|
if (txg == 0) {
|
2018-10-02 01:13:12 +03:00
|
|
|
ztest_range_unlock(rl);
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_object_unlock(zd, lr->lr_foid);
|
|
|
|
return (ENOSPC);
|
|
|
|
}
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_free_range(os, lr->lr_foid, lr->lr_offset,
|
|
|
|
lr->lr_length, tx));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
(void) ztest_log_truncate(zd, tx, lr);
|
|
|
|
|
|
|
|
dmu_tx_commit(tx);
|
|
|
|
|
2018-10-02 01:13:12 +03:00
|
|
|
ztest_range_unlock(rl);
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_object_unlock(zd, lr->lr_foid);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
2017-10-27 22:46:35 +03:00
|
|
|
ztest_replay_setattr(void *arg1, void *arg2, boolean_t byteswap)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2017-10-27 22:46:35 +03:00
|
|
|
ztest_ds_t *zd = arg1;
|
|
|
|
lr_setattr_t *lr = arg2;
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *os = zd->zd_os;
|
|
|
|
dmu_tx_t *tx;
|
|
|
|
dmu_buf_t *db;
|
|
|
|
ztest_block_tag_t *bbt;
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
uint64_t txg, lrtxg, crtxg, dnodesize;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (byteswap)
|
|
|
|
byteswap_uint64_array(lr, sizeof (*lr));
|
|
|
|
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
ztest_object_lock(zd, lr->lr_foid, ZTRL_WRITER);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_bonus_hold(os, lr->lr_foid, FTAG, &db));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
tx = dmu_tx_create(os);
|
|
|
|
dmu_tx_hold_bonus(tx, lr->lr_foid);
|
|
|
|
|
|
|
|
txg = ztest_tx_assign(tx, TXG_WAIT, FTAG);
|
|
|
|
if (txg == 0) {
|
|
|
|
dmu_buf_rele(db, FTAG);
|
|
|
|
ztest_object_unlock(zd, lr->lr_foid);
|
|
|
|
return (ENOSPC);
|
|
|
|
}
|
|
|
|
|
|
|
|
bbt = ztest_bt_bonus(db);
|
|
|
|
ASSERT3U(bbt->bt_magic, ==, BT_MAGIC);
|
|
|
|
crtxg = bbt->bt_crtxg;
|
|
|
|
lrtxg = lr->lr_common.lrc_txg;
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
dnodesize = bbt->bt_dnodesize;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (zd->zd_zilog->zl_replay) {
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(lr->lr_size, !=, 0);
|
|
|
|
ASSERT3U(lr->lr_mode, !=, 0);
|
|
|
|
ASSERT3U(lrtxg, !=, 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Randomly change the size and increment the generation.
|
|
|
|
*/
|
|
|
|
lr->lr_size = (ztest_random(db->db_size / sizeof (*bbt)) + 1) *
|
|
|
|
sizeof (*bbt);
|
|
|
|
lr->lr_mode = bbt->bt_gen + 1;
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT0(lrtxg);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that the current bonus buffer is not newer than our txg.
|
|
|
|
*/
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_bt_verify(bbt, os, lr->lr_foid, dnodesize, -1ULL, lr->lr_mode,
|
2010-05-29 00:45:14 +04:00
|
|
|
MAX(txg, lrtxg), crtxg);
|
|
|
|
|
|
|
|
dmu_buf_will_dirty(db, tx);
|
|
|
|
|
|
|
|
ASSERT3U(lr->lr_size, >=, sizeof (*bbt));
|
|
|
|
ASSERT3U(lr->lr_size, <=, db->db_size);
|
2013-05-11 01:17:03 +04:00
|
|
|
VERIFY0(dmu_set_bonus(db, lr->lr_size, tx));
|
2010-05-29 00:45:14 +04:00
|
|
|
bbt = ztest_bt_bonus(db);
|
|
|
|
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_bt_generate(bbt, os, lr->lr_foid, dnodesize, -1ULL, lr->lr_mode,
|
|
|
|
txg, crtxg);
|
|
|
|
ztest_fill_unused_bonus(db, bbt, lr->lr_foid, os, bbt->bt_gen);
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_buf_rele(db, FTAG);
|
|
|
|
|
|
|
|
(void) ztest_log_setattr(zd, tx, lr);
|
|
|
|
|
|
|
|
dmu_tx_commit(tx);
|
|
|
|
|
|
|
|
ztest_object_unlock(zd, lr->lr_foid);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2022-10-18 21:05:32 +03:00
|
|
|
static zil_replay_func_t *ztest_replay_vector[TX_MAX_TYPE] = {
|
2017-10-27 22:46:35 +03:00
|
|
|
NULL, /* 0 no such transaction type */
|
|
|
|
ztest_replay_create, /* TX_CREATE */
|
|
|
|
NULL, /* TX_MKDIR */
|
|
|
|
NULL, /* TX_MKXATTR */
|
|
|
|
NULL, /* TX_SYMLINK */
|
|
|
|
ztest_replay_remove, /* TX_REMOVE */
|
|
|
|
NULL, /* TX_RMDIR */
|
|
|
|
NULL, /* TX_LINK */
|
|
|
|
NULL, /* TX_RENAME */
|
|
|
|
ztest_replay_write, /* TX_WRITE */
|
|
|
|
ztest_replay_truncate, /* TX_TRUNCATE */
|
|
|
|
ztest_replay_setattr, /* TX_SETATTR */
|
|
|
|
NULL, /* TX_ACL */
|
|
|
|
NULL, /* TX_CREATE_ACL */
|
|
|
|
NULL, /* TX_CREATE_ATTR */
|
|
|
|
NULL, /* TX_CREATE_ACL_ATTR */
|
|
|
|
NULL, /* TX_MKDIR_ACL */
|
|
|
|
NULL, /* TX_MKDIR_ATTR */
|
|
|
|
NULL, /* TX_MKDIR_ACL_ATTR */
|
|
|
|
NULL, /* TX_WRITE2 */
|
log xattr=sa create/remove/update to ZIL
As such, there are no specific synchronous semantics defined for
the xattrs. But for xattr=on, it does log to ZIL and zil_commit() is
done, if sync=always is set on dataset. This provides sync semantics
for xattr=on with sync=always set on dataset.
For the xattr=sa implementation, it doesn't log to ZIL, so, even with
sync=always, xattrs are not guaranteed to be synced before xattr call
returns to caller. So, xattr can be lost if system crash happens, before
txg carrying xattr transaction is synced.
This change adds xattr=sa logging to ZIL on xattr create/remove/update
and xattrs are synced to ZIL (zil_commit() done) for sync=always.
This makes xattr=sa behavior similar to xattr=on.
Implementation notes:
The actual logging is fairly straight-forward and does not warrant
additional explanation.
However, it has been 14 years since we last added new TX types
to the ZIL [1], hence this is the first time we do it after the
introduction of zpool features. Therefore, here is an overview of the
feature activation and deactivation workflow:
1. The feature must be enabled. Otherwise, we don't log the new
record type. This ensures compatibility with older software.
2. The feature is activated per-dataset, since the ZIL is per-dataset.
3. If the feature is enabled and dataset is not for zvol, any append to
the ZIL chain will activate the feature for the dataset. Likewise
for starting a new ZIL chain.
4. A dataset that doesn't have a ZIL chain has the feature deactivated.
We ensure (3) by activating on the first zil_commit() after the feature
was enabled. Since activating the features requires waiting for txg
sync, the first zil_commit() after enabling the feature will be slower
than usual. The downside is that this is really a conservative
approximation: even if we never append a 'TX_SETSAXATTR' to the ZIL
chain, we pay the penalty for feature activation. The upside is that the
user is in control of when we pay the penalty, i.e., upon enabling the
feature.
We ensure (4) by hooking into zil_sync(), where ZIL destroy actually
happens.
One more piece on feature activation, since it's spread across
multiple functions:
zil_commit()
zil_process_commit_list()
if lwb == NULL // first zil_commit since zil_open
zil_create()
if no log block pointer in ZIL header:
if feature enabled and not active:
// CASE 1
enable, COALESCE txg wait with dmu_tx that allocated the
log block
else // log block was allocated earlier than this zil_open
if feature enabled and not active:
// CASE 2
enable, EXPLICIT txg wait
else // already have an in-DRAM LWB
if feature enabled and not active:
// this happens when we enable the feature after zil_create
// CASE 3
enable, EXPLICIT txg wait
[1] https://github.com/illumos/illumos-gate/commit/da6c28aaf62fa55f0fdb8004aa40f88f23bf53f0
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Closes #8768
Closes #9078
2022-02-23 00:06:43 +03:00
|
|
|
NULL, /* TX_SETSAXATTR */
|
2019-06-22 03:35:11 +03:00
|
|
|
NULL, /* TX_RENAME_EXCHANGE */
|
|
|
|
NULL, /* TX_RENAME_WHITEOUT */
|
2010-05-29 00:45:14 +04:00
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ZIL get_data callbacks
|
|
|
|
*/
|
|
|
|
|
|
|
|
static void
|
|
|
|
ztest_get_done(zgd_t *zgd, int error)
|
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) error;
|
2018-10-02 01:13:12 +03:00
|
|
|
ztest_ds_t *zd = zgd->zgd_private;
|
|
|
|
uint64_t object = ((rl_t *)zgd->zgd_lr)->rl_object;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (zgd->zgd_db)
|
|
|
|
dmu_buf_rele(zgd->zgd_db, zgd);
|
|
|
|
|
2018-10-02 01:13:12 +03:00
|
|
|
ztest_range_unlock((rl_t *)zgd->zgd_lr);
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_object_unlock(zd, object);
|
|
|
|
|
|
|
|
umem_free(zgd, sizeof (*zgd));
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
2021-03-20 08:53:31 +03:00
|
|
|
ztest_get_data(void *arg, uint64_t arg2, lr_write_t *lr, char *buf,
|
|
|
|
struct lwb *lwb, zio_t *zio)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) arg2;
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_ds_t *zd = arg;
|
|
|
|
objset_t *os = zd->zd_os;
|
|
|
|
uint64_t object = lr->lr_foid;
|
|
|
|
uint64_t offset = lr->lr_offset;
|
|
|
|
uint64_t size = lr->lr_length;
|
|
|
|
uint64_t txg = lr->lr_common.lrc_txg;
|
|
|
|
uint64_t crtxg;
|
|
|
|
dmu_object_info_t doi;
|
|
|
|
dmu_buf_t *db;
|
|
|
|
zgd_t *zgd;
|
|
|
|
int error;
|
|
|
|
|
OpenZFS 8585 - improve batching done in zil_commit()
Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Prakash Surya <prakash.surya@delphix.com>
Problem
=======
The current implementation of zil_commit() can introduce significant
latency, beyond what is inherent due to the latency of the underlying
storage. The additional latency comes from two main problems:
1. When there's outstanding ZIL blocks being written (i.e. there's
already a "writer thread" in progress), then any new calls to
zil_commit() will block waiting for the currently oustanding ZIL
blocks to complete. The blocks written for each "writer thread" is
coined a "batch", and there can only ever be a single "batch" being
written at a time. When a batch is being written, any new ZIL
transactions will have to wait for the next batch to be written,
which won't occur until the current batch finishes.
As a result, the underlying storage may not be used as efficiently
as possible. While "new" threads enter zil_commit() and are blocked
waiting for the next batch, it's possible that the underlying
storage isn't fully utilized by the current batch of ZIL blocks. In
that case, it'd be better to allow these new threads to generate
(and issue) a new ZIL block, such that it could be serviced by the
underlying storage concurrently with the other ZIL blocks that are
being serviced.
2. Any call to zil_commit() must wait for all ZIL blocks in its "batch"
to complete, prior to zil_commit() returning. The size of any given
batch is proportional to the number of ZIL transaction in the queue
at the time that the batch starts processing the queue; which
doesn't occur until the previous batch completes. Thus, if there's a
lot of transactions in the queue, the batch could be composed of
many ZIL blocks, and each call to zil_commit() will have to wait for
all of these writes to complete (even if the thread calling
zil_commit() only cared about one of the transactions in the batch).
To further complicate the situation, these two issues result in the
following side effect:
3. If a given batch takes longer to complete than normal, this results
in larger batch sizes, which then take longer to complete and
further drive up the latency of zil_commit(). This can occur for a
number of reasons, including (but not limited to): transient changes
in the workload, and storage latency irregularites.
Solution
========
The solution attempted by this change has the following goals:
1. no on-disk changes; maintain current on-disk format.
2. modify the "batch size" to be equal to the "ZIL block size".
3. allow new batches to be generated and issued to disk, while there's
already batches being serviced by the disk.
4. allow zil_commit() to wait for as few ZIL blocks as possible.
5. use as few ZIL blocks as possible, for the same amount of ZIL
transactions, without introducing significant latency to any
individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks.
In theory, with these goals met, the new allgorithm will allow the
following improvements:
1. new ZIL blocks can be generated and issued, while there's already
oustanding ZIL blocks being serviced by the storage.
2. the latency of zil_commit() should be proportional to the underlying
storage latency, rather than the incoming synchronous workload.
Porting Notes
=============
Due to the changes made in commit 119a394ab0, the lifetime of an itx
structure differs than in OpenZFS. Specifically, the itx structure is
kept around until the data associated with the itx is considered to be
safe on disk; this is so that the itx's callback can be called after the
data is committed to stable storage. Since OpenZFS doesn't have this itx
callback mechanism, it's able to destroy the itx structure immediately
after the itx is committed to an lwb (before the lwb is written to
disk).
To support this difference, and to ensure the itx's callbacks can still
be called after the itx's data is on disk, a few changes had to be made:
* A list of itxs was added to the lwb structure. This list contains
all of the itxs that have been committed to the lwb, such that the
callbacks for these itxs can be called from zil_lwb_flush_vdevs_done(),
after the data for the itxs is committed to disk.
* A list of itxs was added on the stack of the zil_process_commit_list()
function; the "nolwb_itxs" list. In some circumstances, an itx may
not be committed to an lwb (e.g. if allocating the "next" ZIL block
on disk fails), so this list is used to keep track of which itxs
fall into this state, such that their callbacks can be called after
the ZIL's writer pipeline is "stalled".
* The logic to actually call the itx's callback was moved into the
zil_itx_destroy() function. Since all consumers of zil_itx_destroy()
were effectively performing the same logic (i.e. if callback is
non-null, call the callback), it seemed like useful code cleanup to
consolidate this logic into a single function.
Additionally, the existing Linux tracepoint infrastructure dealing with
the ZIL's probes and structures had to be updated to reflect these code
changes. Specifically:
* The "zil__cw1" and "zil__cw2" probes were removed, so they had to be
removed from "trace_zil.h" as well.
* Some of the zilog structure's fields were removed, which affected
the tracepoint definitions of the structure.
* New tracepoints had to be added for the following 3 new probes:
* zil__process__commit__itx
* zil__process__normal__itx
* zil__commit__io__error
OpenZFS-issue: https://www.illumos.org/issues/8585
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5d95a3a
Closes #6566
2017-12-05 20:39:16 +03:00
|
|
|
ASSERT3P(lwb, !=, NULL);
|
|
|
|
ASSERT3U(size, !=, 0);
|
|
|
|
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
ztest_object_lock(zd, object, ZTRL_READER);
|
2010-05-29 00:45:14 +04:00
|
|
|
error = dmu_bonus_hold(os, object, FTAG, &db);
|
|
|
|
if (error) {
|
|
|
|
ztest_object_unlock(zd, object);
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
crtxg = ztest_bt_bonus(db)->bt_crtxg;
|
|
|
|
|
|
|
|
if (crtxg == 0 || crtxg > txg) {
|
|
|
|
dmu_buf_rele(db, FTAG);
|
|
|
|
ztest_object_unlock(zd, object);
|
|
|
|
return (ENOENT);
|
|
|
|
}
|
|
|
|
|
|
|
|
dmu_object_info_from_db(db, &doi);
|
|
|
|
dmu_buf_rele(db, FTAG);
|
|
|
|
db = NULL;
|
|
|
|
|
|
|
|
zgd = umem_zalloc(sizeof (*zgd), UMEM_NOFAIL);
|
OpenZFS 8585 - improve batching done in zil_commit()
Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Prakash Surya <prakash.surya@delphix.com>
Problem
=======
The current implementation of zil_commit() can introduce significant
latency, beyond what is inherent due to the latency of the underlying
storage. The additional latency comes from two main problems:
1. When there's outstanding ZIL blocks being written (i.e. there's
already a "writer thread" in progress), then any new calls to
zil_commit() will block waiting for the currently oustanding ZIL
blocks to complete. The blocks written for each "writer thread" is
coined a "batch", and there can only ever be a single "batch" being
written at a time. When a batch is being written, any new ZIL
transactions will have to wait for the next batch to be written,
which won't occur until the current batch finishes.
As a result, the underlying storage may not be used as efficiently
as possible. While "new" threads enter zil_commit() and are blocked
waiting for the next batch, it's possible that the underlying
storage isn't fully utilized by the current batch of ZIL blocks. In
that case, it'd be better to allow these new threads to generate
(and issue) a new ZIL block, such that it could be serviced by the
underlying storage concurrently with the other ZIL blocks that are
being serviced.
2. Any call to zil_commit() must wait for all ZIL blocks in its "batch"
to complete, prior to zil_commit() returning. The size of any given
batch is proportional to the number of ZIL transaction in the queue
at the time that the batch starts processing the queue; which
doesn't occur until the previous batch completes. Thus, if there's a
lot of transactions in the queue, the batch could be composed of
many ZIL blocks, and each call to zil_commit() will have to wait for
all of these writes to complete (even if the thread calling
zil_commit() only cared about one of the transactions in the batch).
To further complicate the situation, these two issues result in the
following side effect:
3. If a given batch takes longer to complete than normal, this results
in larger batch sizes, which then take longer to complete and
further drive up the latency of zil_commit(). This can occur for a
number of reasons, including (but not limited to): transient changes
in the workload, and storage latency irregularites.
Solution
========
The solution attempted by this change has the following goals:
1. no on-disk changes; maintain current on-disk format.
2. modify the "batch size" to be equal to the "ZIL block size".
3. allow new batches to be generated and issued to disk, while there's
already batches being serviced by the disk.
4. allow zil_commit() to wait for as few ZIL blocks as possible.
5. use as few ZIL blocks as possible, for the same amount of ZIL
transactions, without introducing significant latency to any
individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks.
In theory, with these goals met, the new allgorithm will allow the
following improvements:
1. new ZIL blocks can be generated and issued, while there's already
oustanding ZIL blocks being serviced by the storage.
2. the latency of zil_commit() should be proportional to the underlying
storage latency, rather than the incoming synchronous workload.
Porting Notes
=============
Due to the changes made in commit 119a394ab0, the lifetime of an itx
structure differs than in OpenZFS. Specifically, the itx structure is
kept around until the data associated with the itx is considered to be
safe on disk; this is so that the itx's callback can be called after the
data is committed to stable storage. Since OpenZFS doesn't have this itx
callback mechanism, it's able to destroy the itx structure immediately
after the itx is committed to an lwb (before the lwb is written to
disk).
To support this difference, and to ensure the itx's callbacks can still
be called after the itx's data is on disk, a few changes had to be made:
* A list of itxs was added to the lwb structure. This list contains
all of the itxs that have been committed to the lwb, such that the
callbacks for these itxs can be called from zil_lwb_flush_vdevs_done(),
after the data for the itxs is committed to disk.
* A list of itxs was added on the stack of the zil_process_commit_list()
function; the "nolwb_itxs" list. In some circumstances, an itx may
not be committed to an lwb (e.g. if allocating the "next" ZIL block
on disk fails), so this list is used to keep track of which itxs
fall into this state, such that their callbacks can be called after
the ZIL's writer pipeline is "stalled".
* The logic to actually call the itx's callback was moved into the
zil_itx_destroy() function. Since all consumers of zil_itx_destroy()
were effectively performing the same logic (i.e. if callback is
non-null, call the callback), it seemed like useful code cleanup to
consolidate this logic into a single function.
Additionally, the existing Linux tracepoint infrastructure dealing with
the ZIL's probes and structures had to be updated to reflect these code
changes. Specifically:
* The "zil__cw1" and "zil__cw2" probes were removed, so they had to be
removed from "trace_zil.h" as well.
* Some of the zilog structure's fields were removed, which affected
the tracepoint definitions of the structure.
* New tracepoints had to be added for the following 3 new probes:
* zil__process__commit__itx
* zil__process__normal__itx
* zil__commit__io__error
OpenZFS-issue: https://www.illumos.org/issues/8585
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5d95a3a
Closes #6566
2017-12-05 20:39:16 +03:00
|
|
|
zgd->zgd_lwb = lwb;
|
2018-10-02 01:13:12 +03:00
|
|
|
zgd->zgd_private = zd;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (buf != NULL) { /* immediate write */
|
2019-11-01 20:37:33 +03:00
|
|
|
zgd->zgd_lr = (struct zfs_locked_range *)ztest_range_lock(zd,
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
object, offset, size, ZTRL_READER);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
error = dmu_read(os, object, offset, size, buf,
|
|
|
|
DMU_READ_NO_PREFETCH);
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT0(error);
|
2010-05-29 00:45:14 +04:00
|
|
|
} else {
|
ZIL: Second attempt to reduce scope of zl_issuer_lock.
The previous patch #14841 appeared to have significant flaw, causing
deadlocks if zl_get_data callback got blocked waiting for TXG sync. I
already handled some of such cases in the original patch, but issue
#14982 shown cases that were impossible to solve in that design.
This patch fixes the problem by postponing log blocks allocation till
the very end, just before the zios issue, leaving nothing blocking after
that point to cause deadlocks. Before that point though any sleeps are
now allowed, not causing sync thread blockage. This require slightly
more complicated lwb state machine to allocate blocks and issue zios
in proper order. But with removal of special early issue workarounds
the new code is much cleaner now, and should even be more efficient.
Since this patch uses null zios between write, I've found that null
zios do not wait for logical children ready status in zio_ready(),
that makes parent write to proceed prematurely, producing incorrect
log blocks. Added ZIO_CHILD_LOGICAL_BIT to zio_wait_for_children()
fixes it.
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15122
2023-08-25 03:08:49 +03:00
|
|
|
ASSERT3P(zio, !=, NULL);
|
2010-05-29 00:45:14 +04:00
|
|
|
size = doi.doi_data_block_size;
|
|
|
|
if (ISP2(size)) {
|
2024-05-10 18:47:21 +03:00
|
|
|
offset = P2ALIGN_TYPED(offset, size, uint64_t);
|
2010-05-29 00:45:14 +04:00
|
|
|
} else {
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(offset, <, size);
|
2010-05-29 00:45:14 +04:00
|
|
|
offset = 0;
|
|
|
|
}
|
|
|
|
|
2019-11-01 20:37:33 +03:00
|
|
|
zgd->zgd_lr = (struct zfs_locked_range *)ztest_range_lock(zd,
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
object, offset, size, ZTRL_READER);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2023-09-22 04:40:13 +03:00
|
|
|
error = dmu_buf_hold_noread(os, object, offset, zgd, &db);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (error == 0) {
|
2017-04-14 22:59:18 +03:00
|
|
|
blkptr_t *bp = &lr->lr_blkptr;
|
2013-05-10 23:47:54 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
zgd->zgd_db = db;
|
|
|
|
zgd->zgd_bp = bp;
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(db->db_offset, ==, offset);
|
|
|
|
ASSERT3U(db->db_size, ==, size);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
error = dmu_sync(zio, lr->lr_common.lrc_txg,
|
|
|
|
ztest_get_done, zgd);
|
|
|
|
|
|
|
|
if (error == 0)
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
ztest_get_done(zgd, error);
|
|
|
|
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void *
|
|
|
|
ztest_lr_alloc(size_t lrsize, char *name)
|
|
|
|
{
|
|
|
|
char *lr;
|
|
|
|
size_t namesize = name ? strlen(name) + 1 : 0;
|
|
|
|
|
|
|
|
lr = umem_zalloc(lrsize + namesize, UMEM_NOFAIL);
|
|
|
|
|
|
|
|
if (name)
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(lr + lrsize, name, namesize);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
return (lr);
|
|
|
|
}
|
|
|
|
|
2020-06-15 21:30:37 +03:00
|
|
|
static void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_lr_free(void *lr, size_t lrsize, char *name)
|
|
|
|
{
|
|
|
|
size_t namesize = name ? strlen(name) + 1 : 0;
|
|
|
|
|
|
|
|
umem_free(lr, lrsize + namesize);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Lookup a bunch of objects. Returns the number of objects not found.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
ztest_lookup(ztest_ds_t *zd, ztest_od_t *od, int count)
|
|
|
|
{
|
|
|
|
int missing = 0;
|
|
|
|
int error;
|
2010-08-26 20:52:39 +04:00
|
|
|
int i;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
Simplify threads, mutexs, cvs and rwlocks
* Simplify threads, mutexs, cvs and rwlocks
* Update the zk_thread_create() function to use the same trick
as Illumos. Specifically, cast the new pthread_t to a void
pointer and return that as the kthread_t *. This avoids the
issues associated with managing a wrapper structure and is
safe as long as the callers never attempt to dereference it.
* Update all function prototypes passed to pthread_create() to
match the expected prototype. We were getting away this with
before since the function were explicitly cast.
* Replaced direct zk_thread_create() calls with thread_create()
for code consistency. All consumers of libzpool now use the
proper wrappers.
* The mutex_held() calls were converted to MUTEX_HELD().
* Removed all mutex_owner() calls and retired the interface.
Instead use MUTEX_HELD() which provides the same information
and allows the implementation details to be hidden. In this
case the use of the pthread_equals() function.
* The kthread_t, kmutex_t, krwlock_t, and krwlock_t types had
any non essential fields removed. In the case of kthread_t
and kcondvar_t they could be directly typedef'd to pthread_t
and pthread_cond_t respectively.
* Removed all extra ASSERTS from the thread, mutex, rwlock, and
cv wrapper functions. In practice, pthreads already provides
the vast majority of checks as long as we check the return
code. Removing this code from our wrappers help readability.
* Added TS_JOINABLE state flag to pass to request a joinable rather
than detached thread. This isn't a standard thread_create() state
but it's the least invasive way to pass this information and is
only used by ztest.
TEST_ZTEST_TIMEOUT=3600
Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4547
Closes #5503
Closes #5523
Closes #6377
Closes #6495
2017-08-11 18:51:44 +03:00
|
|
|
ASSERT(MUTEX_HELD(&zd->zd_dirobj_lock));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (i = 0; i < count; i++, od++) {
|
2010-05-29 00:45:14 +04:00
|
|
|
od->od_object = 0;
|
|
|
|
error = zap_lookup(zd->zd_os, od->od_dir, od->od_name,
|
|
|
|
sizeof (uint64_t), 1, &od->od_object);
|
|
|
|
if (error) {
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3S(error, ==, ENOENT);
|
|
|
|
ASSERT0(od->od_object);
|
2010-05-29 00:45:14 +04:00
|
|
|
missing++;
|
|
|
|
} else {
|
|
|
|
dmu_buf_t *db;
|
|
|
|
ztest_block_tag_t *bbt;
|
|
|
|
dmu_object_info_t doi;
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(od->od_object, !=, 0);
|
|
|
|
ASSERT0(missing); /* there should be no gaps */
|
2010-05-29 00:45:14 +04:00
|
|
|
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
ztest_object_lock(zd, od->od_object, ZTRL_READER);
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_bonus_hold(zd->zd_os, od->od_object,
|
|
|
|
FTAG, &db));
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_object_info_from_db(db, &doi);
|
|
|
|
bbt = ztest_bt_bonus(db);
|
|
|
|
ASSERT3U(bbt->bt_magic, ==, BT_MAGIC);
|
|
|
|
od->od_type = doi.doi_type;
|
|
|
|
od->od_blocksize = doi.doi_data_block_size;
|
|
|
|
od->od_gen = bbt->bt_gen;
|
|
|
|
dmu_buf_rele(db, FTAG);
|
|
|
|
ztest_object_unlock(zd, od->od_object);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return (missing);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
ztest_create(ztest_ds_t *zd, ztest_od_t *od, int count)
|
|
|
|
{
|
|
|
|
int missing = 0;
|
2010-08-26 20:52:39 +04:00
|
|
|
int i;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
Simplify threads, mutexs, cvs and rwlocks
* Simplify threads, mutexs, cvs and rwlocks
* Update the zk_thread_create() function to use the same trick
as Illumos. Specifically, cast the new pthread_t to a void
pointer and return that as the kthread_t *. This avoids the
issues associated with managing a wrapper structure and is
safe as long as the callers never attempt to dereference it.
* Update all function prototypes passed to pthread_create() to
match the expected prototype. We were getting away this with
before since the function were explicitly cast.
* Replaced direct zk_thread_create() calls with thread_create()
for code consistency. All consumers of libzpool now use the
proper wrappers.
* The mutex_held() calls were converted to MUTEX_HELD().
* Removed all mutex_owner() calls and retired the interface.
Instead use MUTEX_HELD() which provides the same information
and allows the implementation details to be hidden. In this
case the use of the pthread_equals() function.
* The kthread_t, kmutex_t, krwlock_t, and krwlock_t types had
any non essential fields removed. In the case of kthread_t
and kcondvar_t they could be directly typedef'd to pthread_t
and pthread_cond_t respectively.
* Removed all extra ASSERTS from the thread, mutex, rwlock, and
cv wrapper functions. In practice, pthreads already provides
the vast majority of checks as long as we check the return
code. Removing this code from our wrappers help readability.
* Added TS_JOINABLE state flag to pass to request a joinable rather
than detached thread. This isn't a standard thread_create() state
but it's the least invasive way to pass this information and is
only used by ztest.
TEST_ZTEST_TIMEOUT=3600
Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4547
Closes #5503
Closes #5523
Closes #6377
Closes #6495
2017-08-11 18:51:44 +03:00
|
|
|
ASSERT(MUTEX_HELD(&zd->zd_dirobj_lock));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (i = 0; i < count; i++, od++) {
|
2010-05-29 00:45:14 +04:00
|
|
|
if (missing) {
|
|
|
|
od->od_object = 0;
|
|
|
|
missing++;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2024-09-27 19:18:11 +03:00
|
|
|
lr_create_t *lrc = ztest_lr_alloc(sizeof (*lrc), od->od_name);
|
|
|
|
_lr_create_t *lr = &lrc->lr_create;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
lr->lr_doid = od->od_dir;
|
|
|
|
lr->lr_foid = 0; /* 0 to allocate, > 0 to claim */
|
|
|
|
lr->lrz_type = od->od_crtype;
|
|
|
|
lr->lrz_blocksize = od->od_crblocksize;
|
|
|
|
lr->lrz_ibshift = ztest_random_ibshift();
|
|
|
|
lr->lrz_bonustype = DMU_OT_UINT64_OTHER;
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
lr->lrz_dnodesize = od->od_crdnodesize;
|
2010-05-29 00:45:14 +04:00
|
|
|
lr->lr_gen = od->od_crgen;
|
|
|
|
lr->lr_crtime[0] = time(NULL);
|
|
|
|
|
|
|
|
if (ztest_replay_create(zd, lr, B_FALSE) != 0) {
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT0(missing);
|
2010-05-29 00:45:14 +04:00
|
|
|
od->od_object = 0;
|
|
|
|
missing++;
|
|
|
|
} else {
|
|
|
|
od->od_object = lr->lr_foid;
|
|
|
|
od->od_type = od->od_crtype;
|
|
|
|
od->od_blocksize = od->od_crblocksize;
|
|
|
|
od->od_gen = od->od_crgen;
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(od->od_object, !=, 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
ztest_lr_free(lr, sizeof (*lr), od->od_name);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (missing);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
ztest_remove(ztest_ds_t *zd, ztest_od_t *od, int count)
|
|
|
|
{
|
|
|
|
int missing = 0;
|
|
|
|
int error;
|
2010-08-26 20:52:39 +04:00
|
|
|
int i;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
Simplify threads, mutexs, cvs and rwlocks
* Simplify threads, mutexs, cvs and rwlocks
* Update the zk_thread_create() function to use the same trick
as Illumos. Specifically, cast the new pthread_t to a void
pointer and return that as the kthread_t *. This avoids the
issues associated with managing a wrapper structure and is
safe as long as the callers never attempt to dereference it.
* Update all function prototypes passed to pthread_create() to
match the expected prototype. We were getting away this with
before since the function were explicitly cast.
* Replaced direct zk_thread_create() calls with thread_create()
for code consistency. All consumers of libzpool now use the
proper wrappers.
* The mutex_held() calls were converted to MUTEX_HELD().
* Removed all mutex_owner() calls and retired the interface.
Instead use MUTEX_HELD() which provides the same information
and allows the implementation details to be hidden. In this
case the use of the pthread_equals() function.
* The kthread_t, kmutex_t, krwlock_t, and krwlock_t types had
any non essential fields removed. In the case of kthread_t
and kcondvar_t they could be directly typedef'd to pthread_t
and pthread_cond_t respectively.
* Removed all extra ASSERTS from the thread, mutex, rwlock, and
cv wrapper functions. In practice, pthreads already provides
the vast majority of checks as long as we check the return
code. Removing this code from our wrappers help readability.
* Added TS_JOINABLE state flag to pass to request a joinable rather
than detached thread. This isn't a standard thread_create() state
but it's the least invasive way to pass this information and is
only used by ztest.
TEST_ZTEST_TIMEOUT=3600
Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4547
Closes #5503
Closes #5523
Closes #6377
Closes #6495
2017-08-11 18:51:44 +03:00
|
|
|
ASSERT(MUTEX_HELD(&zd->zd_dirobj_lock));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
od += count - 1;
|
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (i = count - 1; i >= 0; i--, od--) {
|
2010-05-29 00:45:14 +04:00
|
|
|
if (missing) {
|
|
|
|
missing++;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2013-05-10 23:47:54 +04:00
|
|
|
/*
|
|
|
|
* No object was found.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
if (od->od_object == 0)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
lr_remove_t *lr = ztest_lr_alloc(sizeof (*lr), od->od_name);
|
|
|
|
|
|
|
|
lr->lr_doid = od->od_dir;
|
|
|
|
|
|
|
|
if ((error = ztest_replay_remove(zd, lr, B_FALSE)) != 0) {
|
|
|
|
ASSERT3U(error, ==, ENOSPC);
|
|
|
|
missing++;
|
|
|
|
} else {
|
|
|
|
od->od_object = 0;
|
|
|
|
}
|
|
|
|
ztest_lr_free(lr, sizeof (*lr), od->od_name);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (missing);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
ztest_write(ztest_ds_t *zd, uint64_t object, uint64_t offset, uint64_t size,
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
const void *data)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
|
|
|
lr_write_t *lr;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
lr = ztest_lr_alloc(sizeof (*lr) + size, NULL);
|
|
|
|
|
|
|
|
lr->lr_foid = object;
|
|
|
|
lr->lr_offset = offset;
|
|
|
|
lr->lr_length = size;
|
|
|
|
lr->lr_blkoff = 0;
|
|
|
|
BP_ZERO(&lr->lr_blkptr);
|
|
|
|
|
2024-09-27 19:18:11 +03:00
|
|
|
memcpy(&lr->lr_data[0], data, size);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
error = ztest_replay_write(zd, lr, B_FALSE);
|
|
|
|
|
|
|
|
ztest_lr_free(lr, sizeof (*lr) + size, NULL);
|
|
|
|
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
ztest_truncate(ztest_ds_t *zd, uint64_t object, uint64_t offset, uint64_t size)
|
|
|
|
{
|
|
|
|
lr_truncate_t *lr;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
lr = ztest_lr_alloc(sizeof (*lr), NULL);
|
|
|
|
|
|
|
|
lr->lr_foid = object;
|
|
|
|
lr->lr_offset = offset;
|
|
|
|
lr->lr_length = size;
|
|
|
|
|
|
|
|
error = ztest_replay_truncate(zd, lr, B_FALSE);
|
|
|
|
|
|
|
|
ztest_lr_free(lr, sizeof (*lr), NULL);
|
|
|
|
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
ztest_setattr(ztest_ds_t *zd, uint64_t object)
|
|
|
|
{
|
|
|
|
lr_setattr_t *lr;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
lr = ztest_lr_alloc(sizeof (*lr), NULL);
|
|
|
|
|
|
|
|
lr->lr_foid = object;
|
|
|
|
lr->lr_size = 0;
|
|
|
|
lr->lr_mode = 0;
|
|
|
|
|
|
|
|
error = ztest_replay_setattr(zd, lr, B_FALSE);
|
|
|
|
|
|
|
|
ztest_lr_free(lr, sizeof (*lr), NULL);
|
|
|
|
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
ztest_prealloc(ztest_ds_t *zd, uint64_t object, uint64_t offset, uint64_t size)
|
|
|
|
{
|
|
|
|
objset_t *os = zd->zd_os;
|
|
|
|
dmu_tx_t *tx;
|
|
|
|
uint64_t txg;
|
2018-10-02 01:13:12 +03:00
|
|
|
rl_t *rl;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
txg_wait_synced(dmu_objset_pool(os), 0);
|
|
|
|
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
ztest_object_lock(zd, object, ZTRL_READER);
|
|
|
|
rl = ztest_range_lock(zd, object, offset, size, ZTRL_WRITER);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
tx = dmu_tx_create(os);
|
|
|
|
|
|
|
|
dmu_tx_hold_write(tx, object, offset, size);
|
|
|
|
|
|
|
|
txg = ztest_tx_assign(tx, TXG_WAIT, FTAG);
|
|
|
|
|
|
|
|
if (txg != 0) {
|
|
|
|
dmu_prealloc(os, object, offset, size, tx);
|
|
|
|
dmu_tx_commit(tx);
|
|
|
|
txg_wait_synced(dmu_objset_pool(os), txg);
|
|
|
|
} else {
|
|
|
|
(void) dmu_free_long_range(os, object, offset, size);
|
|
|
|
}
|
|
|
|
|
2018-10-02 01:13:12 +03:00
|
|
|
ztest_range_unlock(rl);
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_object_unlock(zd, object);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
ztest_io(ztest_ds_t *zd, uint64_t object, uint64_t offset)
|
|
|
|
{
|
2013-05-10 23:47:54 +04:00
|
|
|
int err;
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_block_tag_t wbt;
|
|
|
|
dmu_object_info_t doi;
|
|
|
|
enum ztest_io_type io_type;
|
|
|
|
uint64_t blocksize;
|
|
|
|
void *data;
|
Adding Direct IO Support
Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.
O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated TXG is synced.
For both O_DIRECT read and write request the offset and request sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned unless the direct property is set to always (see
below).
For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,
then it will be discarded from the ARC forcing all further reads to
retrieve the data from disk.
For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event
that the requested data is in buffered (in the ARC) it will just be
copied from the ARC into the user buffer.
For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in
the event that file contents are mmap'ed. In this case, all requests
that are at least PAGE_SIZE aligned will just fall back to the buffered
paths. If the request however is not PAGE_SIZE aligned, EINVAL will
be returned as always regardless if the file's contents are mmap'ed.
Since O_DIRECT writes go through the normal ZIO pipeline, the
following operations are supported just as with normal buffered writes:
Checksum
Compression
Encryption
Erasure Coding
There is one caveat for the data integrity of O_DIRECT writes that is
distinct for each of the OS's supported by ZFS.
FreeBSD - FreeBSD is able to place user pages under write protection so
any data in the user buffers and written directly down to the
VDEV disks is guaranteed to not change. There is no concern
with data integrity and O_DIRECT writes.
Linux - Linux is not able to place anonymous user pages under write
protection. Because of this, if the user decides to manipulate
the page contents while the write operation is occurring, data
integrity can not be guaranteed. However, there is a module
parameter `zfs_vdev_direct_write_verify` that controls the
if a O_DIRECT writes that can occur to a top-level VDEV before
a checksum verify is run before the contents of the I/O buffer
are committed to disk. In the event of a checksum verification
failure the write will return EIO. The number of O_DIRECT write
checksum verification errors can be observed by doing
`zpool status -d`, which will list all verification errors that
have occurred on a top-level VDEV. Along with `zpool status`, a
ZED event will be issues as `dio_verify` when a checksum
verification error occurs.
ZVOLs and dedup is not currently supported with Direct I/O.
A new dataset property `direct` has been added with the following 3
allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats
the request as a buffered IO request.
standard - Follows the alignment restrictions outlined above for
write/read IO requests when the O_DIRECT flag is used.
always - Treats every write/read IO request as though it passed
O_DIRECT and will do O_DIRECT if the alignment restrictions
are met otherwise will redirect through the ARC. This
property will not allow a request to fail.
There is also a module parameter zfs_dio_enabled that can be used to
force all reads and writes through the ARC. By setting this module
parameter to 0, it mimics as if the direct dataset property is set to
disabled.
Reviewed-by: Brian Behlendorf <behlendorf@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Co-authored-by: Mark Maybee <mark.maybee@delphix.com>
Co-authored-by: Matt Macy <mmacy@FreeBSD.org>
Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov>
Closes #10018
2024-09-14 23:47:59 +03:00
|
|
|
uint32_t dmu_read_flags = DMU_READ_NO_PREFETCH;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We will randomly set when to do O_DIRECT on a read.
|
|
|
|
*/
|
|
|
|
if (ztest_random(4) == 0)
|
|
|
|
dmu_read_flags |= DMU_DIRECTIO;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_object_info(zd->zd_os, object, &doi));
|
2010-05-29 00:45:14 +04:00
|
|
|
blocksize = doi.doi_data_block_size;
|
|
|
|
data = umem_alloc(blocksize, UMEM_NOFAIL);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Pick an i/o type at random, biased toward writing block tags.
|
|
|
|
*/
|
|
|
|
io_type = ztest_random(ZTEST_IO_TYPES);
|
|
|
|
if (ztest_random(2) == 0)
|
|
|
|
io_type = ZTEST_IO_WRITE_TAG;
|
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_rdlock(&zd->zd_zilog_lock);
|
2011-07-26 23:41:53 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
switch (io_type) {
|
|
|
|
|
|
|
|
case ZTEST_IO_WRITE_TAG:
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_bt_generate(&wbt, zd->zd_os, object, doi.doi_dnodesize,
|
|
|
|
offset, 0, 0, 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) ztest_write(zd, object, offset, sizeof (wbt), &wbt);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case ZTEST_IO_WRITE_PATTERN:
|
|
|
|
(void) memset(data, 'a' + (object + offset) % 5, blocksize);
|
|
|
|
if (ztest_random(2) == 0) {
|
|
|
|
/*
|
|
|
|
* Induce fletcher2 collisions to ensure that
|
|
|
|
* zio_ddt_collision() detects and resolves them
|
|
|
|
* when using fletcher2-verify for deduplication.
|
|
|
|
*/
|
|
|
|
((uint64_t *)data)[0] ^= 1ULL << 63;
|
|
|
|
((uint64_t *)data)[4] ^= 1ULL << 63;
|
|
|
|
}
|
|
|
|
(void) ztest_write(zd, object, offset, blocksize, data);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case ZTEST_IO_WRITE_ZEROES:
|
2022-02-25 16:26:54 +03:00
|
|
|
memset(data, 0, blocksize);
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) ztest_write(zd, object, offset, blocksize, data);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case ZTEST_IO_TRUNCATE:
|
|
|
|
(void) ztest_truncate(zd, object, offset, blocksize);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case ZTEST_IO_SETATTR:
|
|
|
|
(void) ztest_setattr(zd, object);
|
|
|
|
break;
|
2010-08-26 20:52:41 +04:00
|
|
|
default:
|
|
|
|
break;
|
2013-05-10 23:47:54 +04:00
|
|
|
|
|
|
|
case ZTEST_IO_REWRITE:
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_rdlock(&ztest_name_lock);
|
2013-05-10 23:47:54 +04:00
|
|
|
err = ztest_dsl_prop_set_uint64(zd->zd_name,
|
|
|
|
ZFS_PROP_CHECKSUM, spa_dedup_checksum(ztest_spa),
|
|
|
|
B_FALSE);
|
2023-01-05 03:23:28 +03:00
|
|
|
ASSERT(err == 0 || err == ENOSPC);
|
2013-05-10 23:47:54 +04:00
|
|
|
err = ztest_dsl_prop_set_uint64(zd->zd_name,
|
|
|
|
ZFS_PROP_COMPRESSION,
|
|
|
|
ztest_random_dsl_prop(ZFS_PROP_COMPRESSION),
|
|
|
|
B_FALSE);
|
2023-01-05 03:23:28 +03:00
|
|
|
ASSERT(err == 0 || err == ENOSPC);
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&ztest_name_lock);
|
2013-05-10 23:47:54 +04:00
|
|
|
|
|
|
|
VERIFY0(dmu_read(zd->zd_os, object, offset, blocksize, data,
|
Adding Direct IO Support
Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.
O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated TXG is synced.
For both O_DIRECT read and write request the offset and request sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned unless the direct property is set to always (see
below).
For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,
then it will be discarded from the ARC forcing all further reads to
retrieve the data from disk.
For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event
that the requested data is in buffered (in the ARC) it will just be
copied from the ARC into the user buffer.
For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in
the event that file contents are mmap'ed. In this case, all requests
that are at least PAGE_SIZE aligned will just fall back to the buffered
paths. If the request however is not PAGE_SIZE aligned, EINVAL will
be returned as always regardless if the file's contents are mmap'ed.
Since O_DIRECT writes go through the normal ZIO pipeline, the
following operations are supported just as with normal buffered writes:
Checksum
Compression
Encryption
Erasure Coding
There is one caveat for the data integrity of O_DIRECT writes that is
distinct for each of the OS's supported by ZFS.
FreeBSD - FreeBSD is able to place user pages under write protection so
any data in the user buffers and written directly down to the
VDEV disks is guaranteed to not change. There is no concern
with data integrity and O_DIRECT writes.
Linux - Linux is not able to place anonymous user pages under write
protection. Because of this, if the user decides to manipulate
the page contents while the write operation is occurring, data
integrity can not be guaranteed. However, there is a module
parameter `zfs_vdev_direct_write_verify` that controls the
if a O_DIRECT writes that can occur to a top-level VDEV before
a checksum verify is run before the contents of the I/O buffer
are committed to disk. In the event of a checksum verification
failure the write will return EIO. The number of O_DIRECT write
checksum verification errors can be observed by doing
`zpool status -d`, which will list all verification errors that
have occurred on a top-level VDEV. Along with `zpool status`, a
ZED event will be issues as `dio_verify` when a checksum
verification error occurs.
ZVOLs and dedup is not currently supported with Direct I/O.
A new dataset property `direct` has been added with the following 3
allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats
the request as a buffered IO request.
standard - Follows the alignment restrictions outlined above for
write/read IO requests when the O_DIRECT flag is used.
always - Treats every write/read IO request as though it passed
O_DIRECT and will do O_DIRECT if the alignment restrictions
are met otherwise will redirect through the ARC. This
property will not allow a request to fail.
There is also a module parameter zfs_dio_enabled that can be used to
force all reads and writes through the ARC. By setting this module
parameter to 0, it mimics as if the direct dataset property is set to
disabled.
Reviewed-by: Brian Behlendorf <behlendorf@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Co-authored-by: Mark Maybee <mark.maybee@delphix.com>
Co-authored-by: Matt Macy <mmacy@FreeBSD.org>
Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov>
Closes #10018
2024-09-14 23:47:59 +03:00
|
|
|
dmu_read_flags));
|
2013-05-10 23:47:54 +04:00
|
|
|
|
|
|
|
(void) ztest_write(zd, object, offset, blocksize, data);
|
|
|
|
break;
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&zd->zd_zilog_lock);
|
2011-07-26 23:41:53 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
umem_free(data, blocksize);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Initialize an object description template.
|
|
|
|
*/
|
|
|
|
static void
|
2022-04-19 21:49:30 +03:00
|
|
|
ztest_od_init(ztest_od_t *od, uint64_t id, const char *tag, uint64_t index,
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
dmu_object_type_t type, uint64_t blocksize, uint64_t dnodesize,
|
|
|
|
uint64_t gen)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
|
|
|
od->od_dir = ZTEST_DIROBJ;
|
|
|
|
od->od_object = 0;
|
|
|
|
|
|
|
|
od->od_crtype = type;
|
|
|
|
od->od_crblocksize = blocksize ? blocksize : ztest_random_blocksize();
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
od->od_crdnodesize = dnodesize ? dnodesize : ztest_random_dnodesize();
|
2010-05-29 00:45:14 +04:00
|
|
|
od->od_crgen = gen;
|
|
|
|
|
|
|
|
od->od_type = DMU_OT_NONE;
|
|
|
|
od->od_blocksize = 0;
|
|
|
|
od->od_gen = 0;
|
|
|
|
|
2021-06-05 15:18:37 +03:00
|
|
|
(void) snprintf(od->od_name, sizeof (od->od_name),
|
|
|
|
"%s(%"PRId64")[%"PRIu64"]",
|
|
|
|
tag, id, index);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Lookup or create the objects for a test using the od template.
|
|
|
|
* If the objects do not all exist, or if 'remove' is specified,
|
|
|
|
* remove any existing objects and create new ones. Otherwise,
|
|
|
|
* use the existing objects.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
ztest_object_init(ztest_ds_t *zd, ztest_od_t *od, size_t size, boolean_t remove)
|
|
|
|
{
|
|
|
|
int count = size / sizeof (*od);
|
|
|
|
int rv = 0;
|
|
|
|
|
2010-08-26 21:43:27 +04:00
|
|
|
mutex_enter(&zd->zd_dirobj_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
if ((ztest_lookup(zd, od, count) != 0 || remove) &&
|
|
|
|
(ztest_remove(zd, od, count) != 0 ||
|
|
|
|
ztest_create(zd, od, count) != 0))
|
|
|
|
rv = -1;
|
|
|
|
zd->zd_od = od;
|
2010-08-26 21:43:27 +04:00
|
|
|
mutex_exit(&zd->zd_dirobj_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
return (rv);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
ztest_zil_commit(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) id;
|
2010-05-29 00:45:14 +04:00
|
|
|
zilog_t *zilog = zd->zd_zilog;
|
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_rdlock(&zd->zd_zilog_lock);
|
2011-07-26 23:41:53 +04:00
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
zil_commit(zilog, ztest_random(ZTEST_OBJECTS));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Remember the committed values in zd, which is in parent/child
|
|
|
|
* shared memory. If we die, the next iteration of ztest_run()
|
|
|
|
* will verify that the log really does contain this record.
|
|
|
|
*/
|
|
|
|
mutex_enter(&zilog->zl_lock);
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3P(zd->zd_shared, !=, NULL);
|
2012-01-24 06:43:32 +04:00
|
|
|
ASSERT3U(zd->zd_shared->zd_seq, <=, zilog->zl_commit_lr_seq);
|
|
|
|
zd->zd_shared->zd_seq = zilog->zl_commit_lr_seq;
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_exit(&zilog->zl_lock);
|
2011-07-26 23:41:53 +04:00
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&zd->zd_zilog_lock);
|
2011-07-26 23:41:53 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This function is designed to simulate the operations that occur during a
|
|
|
|
* mount/unmount operation. We hold the dataset across these operations in an
|
|
|
|
* attempt to expose any implicit assumptions about ZIL management.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ztest_zil_remount(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) id;
|
2011-07-26 23:41:53 +04:00
|
|
|
objset_t *os = zd->zd_os;
|
|
|
|
|
2018-12-04 20:43:31 +03:00
|
|
|
/*
|
|
|
|
* We hold the ztest_vdev_lock so we don't cause problems with
|
|
|
|
* other threads that wish to remove a log device, such as
|
|
|
|
* ztest_device_removal().
|
|
|
|
*/
|
|
|
|
mutex_enter(&ztest_vdev_lock);
|
|
|
|
|
2013-05-10 23:47:54 +04:00
|
|
|
/*
|
|
|
|
* We grab the zd_dirobj_lock to ensure that no other thread is
|
|
|
|
* updating the zil (i.e. adding in-memory log records) and the
|
|
|
|
* zd_zilog_lock to block any I/O.
|
|
|
|
*/
|
2012-12-15 04:13:40 +04:00
|
|
|
mutex_enter(&zd->zd_dirobj_lock);
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_wrlock(&zd->zd_zilog_lock);
|
2011-07-26 23:41:53 +04:00
|
|
|
|
2017-03-09 01:56:19 +03:00
|
|
|
/* zfsvfs_teardown() */
|
2011-07-26 23:41:53 +04:00
|
|
|
zil_close(zd->zd_zilog);
|
|
|
|
|
|
|
|
/* zfsvfs_setup() */
|
2022-07-21 03:14:06 +03:00
|
|
|
VERIFY3P(zil_open(os, ztest_get_data, NULL), ==, zd->zd_zilog);
|
2011-07-26 23:41:53 +04:00
|
|
|
zil_replay(os, zd, ztest_replay_vector);
|
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&zd->zd_zilog_lock);
|
2012-12-15 04:13:40 +04:00
|
|
|
mutex_exit(&zd->zd_dirobj_lock);
|
2018-12-04 20:43:31 +03:00
|
|
|
mutex_exit(&ztest_vdev_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that we can't destroy an active pool, create an existing pool,
|
|
|
|
* or create a pool with a bad vdev spec.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ztest_spa_create_destroy(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) zd, (void) id;
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_shared_opts_t *zo = &ztest_opts;
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_t *spa;
|
|
|
|
nvlist_t *nvroot;
|
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
if (zo->zo_mmp_test)
|
|
|
|
return;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Attempt to create using a bad file.
|
|
|
|
*/
|
2018-09-06 04:33:36 +03:00
|
|
|
nvroot = make_vdev_root("/dev/bogus", NULL, NULL, 0, 0, NULL, 0, 0, 1);
|
2010-05-29 00:45:14 +04:00
|
|
|
VERIFY3U(ENOENT, ==,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
spa_create("ztest_bad_file", nvroot, NULL, NULL, NULL));
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(nvroot);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Attempt to create using a bad mirror.
|
|
|
|
*/
|
2018-09-06 04:33:36 +03:00
|
|
|
nvroot = make_vdev_root("/dev/bogus", NULL, NULL, 0, 0, NULL, 0, 2, 1);
|
2010-05-29 00:45:14 +04:00
|
|
|
VERIFY3U(ENOENT, ==,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
spa_create("ztest_bad_mirror", nvroot, NULL, NULL, NULL));
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(nvroot);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Attempt to create an existing pool. It shouldn't matter
|
|
|
|
* what's in the nvroot; we should fail with EEXIST.
|
|
|
|
*/
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_rdlock(&ztest_name_lock);
|
2018-09-06 04:33:36 +03:00
|
|
|
nvroot = make_vdev_root("/dev/bogus", NULL, NULL, 0, 0, NULL, 0, 0, 1);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
VERIFY3U(EEXIST, ==,
|
|
|
|
spa_create(zo->zo_pool, nvroot, NULL, NULL, NULL));
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(nvroot);
|
2019-07-18 23:02:33 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We open a reference to the spa and then we try to export it
|
|
|
|
* expecting one of the following errors:
|
|
|
|
*
|
|
|
|
* EBUSY
|
|
|
|
* Because of the reference we just opened.
|
|
|
|
*
|
|
|
|
* ZFS_ERR_EXPORT_IN_PROGRESS
|
|
|
|
* For the case that there is another ztest thread doing
|
|
|
|
* an export concurrently.
|
|
|
|
*/
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(spa_open(zo->zo_pool, &spa, FTAG));
|
2019-07-18 23:02:33 +03:00
|
|
|
int error = spa_destroy(zo->zo_pool);
|
|
|
|
if (error != EBUSY && error != ZFS_ERR_EXPORT_IN_PROGRESS) {
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "spa_destroy(%s) returned unexpected value %d",
|
2019-07-18 23:02:33 +03:00
|
|
|
spa->spa_name, error);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_close(spa, FTAG);
|
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&ztest_name_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2017-07-21 03:54:26 +03:00
|
|
|
/*
|
|
|
|
* Start and then stop the MMP threads to ensure the startup and shutdown code
|
|
|
|
* works properly. Actual protection and property-related code tested via ZTS.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ztest_mmp_enable_disable(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) zd, (void) id;
|
2017-07-21 03:54:26 +03:00
|
|
|
ztest_shared_opts_t *zo = &ztest_opts;
|
|
|
|
spa_t *spa = ztest_spa;
|
|
|
|
|
|
|
|
if (zo->zo_mmp_test)
|
|
|
|
return;
|
|
|
|
|
2019-03-01 04:56:19 +03:00
|
|
|
/*
|
|
|
|
* Since enabling MMP involves setting a property, it could not be done
|
|
|
|
* while the pool is suspended.
|
|
|
|
*/
|
|
|
|
if (spa_suspended(spa))
|
|
|
|
return;
|
|
|
|
|
2017-07-21 03:54:26 +03:00
|
|
|
spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
|
|
|
|
mutex_enter(&spa->spa_props_lock);
|
|
|
|
|
2018-08-13 18:12:25 +03:00
|
|
|
zfs_multihost_fail_intervals = 0;
|
|
|
|
|
2017-07-21 03:54:26 +03:00
|
|
|
if (!spa_multihost(spa)) {
|
|
|
|
spa->spa_multihost = B_TRUE;
|
|
|
|
mmp_thread_start(spa);
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_exit(&spa->spa_props_lock);
|
|
|
|
spa_config_exit(spa, SCL_CONFIG, FTAG);
|
|
|
|
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
mmp_signal_all_threads();
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
|
|
|
|
mutex_enter(&spa->spa_props_lock);
|
|
|
|
|
|
|
|
if (spa_multihost(spa)) {
|
|
|
|
mmp_thread_stop(spa);
|
|
|
|
spa->spa_multihost = B_FALSE;
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_exit(&spa->spa_props_lock);
|
|
|
|
spa_config_exit(spa, SCL_CONFIG, FTAG);
|
|
|
|
}
|
|
|
|
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
static int
|
|
|
|
ztest_get_raidz_children(spa_t *spa)
|
|
|
|
{
|
|
|
|
(void) spa;
|
|
|
|
vdev_t *raidvd;
|
|
|
|
|
|
|
|
ASSERT(MUTEX_HELD(&ztest_vdev_lock));
|
|
|
|
|
|
|
|
if (ztest_opts.zo_raid_do_expand) {
|
|
|
|
raidvd = ztest_spa->spa_root_vdev->vdev_child[0];
|
|
|
|
|
|
|
|
ASSERT(raidvd->vdev_ops == &vdev_raidz_ops);
|
|
|
|
|
|
|
|
return (raidvd->vdev_children);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (ztest_opts.zo_raid_children);
|
|
|
|
}
|
|
|
|
|
2012-12-15 04:28:49 +04:00
|
|
|
void
|
|
|
|
ztest_spa_upgrade(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) zd, (void) id;
|
2012-12-15 04:28:49 +04:00
|
|
|
spa_t *spa;
|
|
|
|
uint64_t initial_version = SPA_VERSION_INITIAL;
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
uint64_t raidz_children, version, newversion;
|
2012-12-15 04:28:49 +04:00
|
|
|
nvlist_t *nvroot, *props;
|
|
|
|
char *name;
|
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
if (ztest_opts.zo_mmp_test)
|
|
|
|
return;
|
|
|
|
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
/* dRAID added after feature flags, skip upgrade test. */
|
|
|
|
if (strcmp(ztest_opts.zo_raid_type, VDEV_TYPE_DRAID) == 0)
|
|
|
|
return;
|
|
|
|
|
2012-12-15 04:28:49 +04:00
|
|
|
mutex_enter(&ztest_vdev_lock);
|
|
|
|
name = kmem_asprintf("%s_upgrade", ztest_opts.zo_pool);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Clean up from previous runs.
|
|
|
|
*/
|
|
|
|
(void) spa_destroy(name);
|
|
|
|
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
raidz_children = ztest_get_raidz_children(ztest_spa);
|
|
|
|
|
2012-12-15 04:28:49 +04:00
|
|
|
nvroot = make_vdev_root(NULL, NULL, name, ztest_opts.zo_vdev_size, 0,
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
NULL, raidz_children, ztest_opts.zo_mirrors, 1);
|
2012-12-15 04:28:49 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we're configuring a RAIDZ device then make sure that the
|
2019-01-13 21:11:52 +03:00
|
|
|
* initial version is capable of supporting that feature.
|
2012-12-15 04:28:49 +04:00
|
|
|
*/
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
switch (ztest_opts.zo_raid_parity) {
|
2012-12-15 04:28:49 +04:00
|
|
|
case 0:
|
|
|
|
case 1:
|
|
|
|
initial_version = SPA_VERSION_INITIAL;
|
|
|
|
break;
|
|
|
|
case 2:
|
|
|
|
initial_version = SPA_VERSION_RAIDZ2;
|
|
|
|
break;
|
|
|
|
case 3:
|
|
|
|
initial_version = SPA_VERSION_RAIDZ3;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Create a pool with a spa version that can be upgraded. Pick
|
|
|
|
* a value between initial_version and SPA_VERSION_BEFORE_FEATURES.
|
|
|
|
*/
|
|
|
|
do {
|
|
|
|
version = ztest_random_spa_version(initial_version);
|
|
|
|
} while (version > SPA_VERSION_BEFORE_FEATURES);
|
|
|
|
|
|
|
|
props = fnvlist_alloc();
|
|
|
|
fnvlist_add_uint64(props,
|
|
|
|
zpool_prop_to_name(ZPOOL_PROP_VERSION), version);
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(spa_create(name, nvroot, props, NULL, NULL));
|
2012-12-15 04:28:49 +04:00
|
|
|
fnvlist_free(nvroot);
|
|
|
|
fnvlist_free(props);
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(spa_open(name, &spa, FTAG));
|
2012-12-15 04:28:49 +04:00
|
|
|
VERIFY3U(spa_version(spa), ==, version);
|
|
|
|
newversion = ztest_random_spa_version(version + 1);
|
|
|
|
|
|
|
|
if (ztest_opts.zo_verbose >= 4) {
|
2021-06-05 15:18:37 +03:00
|
|
|
(void) printf("upgrading spa version from "
|
|
|
|
"%"PRIu64" to %"PRIu64"\n",
|
|
|
|
version, newversion);
|
2012-12-15 04:28:49 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
spa_upgrade(spa, newversion);
|
|
|
|
VERIFY3U(spa_version(spa), >, version);
|
|
|
|
VERIFY3U(spa_version(spa), ==, fnvlist_lookup_uint64(spa->spa_config,
|
|
|
|
zpool_prop_to_name(ZPOOL_PROP_VERSION)));
|
|
|
|
spa_close(spa, FTAG);
|
|
|
|
|
2019-10-10 19:47:06 +03:00
|
|
|
kmem_strfree(name);
|
2012-12-15 04:28:49 +04:00
|
|
|
mutex_exit(&ztest_vdev_lock);
|
|
|
|
}
|
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
static void
|
|
|
|
ztest_spa_checkpoint(spa_t *spa)
|
|
|
|
{
|
|
|
|
ASSERT(MUTEX_HELD(&ztest_checkpoint_lock));
|
|
|
|
|
|
|
|
int error = spa_checkpoint(spa->spa_name);
|
|
|
|
|
|
|
|
switch (error) {
|
|
|
|
case 0:
|
|
|
|
case ZFS_ERR_DEVRM_IN_PROGRESS:
|
|
|
|
case ZFS_ERR_DISCARDING_CHECKPOINT:
|
|
|
|
case ZFS_ERR_CHECKPOINT_EXISTS:
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
case ZFS_ERR_RAIDZ_EXPAND_IN_PROGRESS:
|
2016-12-17 01:11:29 +03:00
|
|
|
break;
|
|
|
|
case ENOSPC:
|
|
|
|
ztest_record_enospc(FTAG);
|
|
|
|
break;
|
|
|
|
default:
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "spa_checkpoint(%s) = %d", spa->spa_name, error);
|
2016-12-17 01:11:29 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
ztest_spa_discard_checkpoint(spa_t *spa)
|
|
|
|
{
|
|
|
|
ASSERT(MUTEX_HELD(&ztest_checkpoint_lock));
|
|
|
|
|
|
|
|
int error = spa_checkpoint_discard(spa->spa_name);
|
|
|
|
|
|
|
|
switch (error) {
|
|
|
|
case 0:
|
|
|
|
case ZFS_ERR_DISCARDING_CHECKPOINT:
|
|
|
|
case ZFS_ERR_NO_CHECKPOINT:
|
|
|
|
break;
|
|
|
|
default:
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "spa_discard_checkpoint(%s) = %d",
|
2016-12-17 01:11:29 +03:00
|
|
|
spa->spa_name, error);
|
|
|
|
}
|
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
ztest_spa_checkpoint_create_discard(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) zd, (void) id;
|
2016-12-17 01:11:29 +03:00
|
|
|
spa_t *spa = ztest_spa;
|
|
|
|
|
|
|
|
mutex_enter(&ztest_checkpoint_lock);
|
|
|
|
if (ztest_random(2) == 0) {
|
|
|
|
ztest_spa_checkpoint(spa);
|
|
|
|
} else {
|
|
|
|
ztest_spa_discard_checkpoint(spa);
|
|
|
|
}
|
|
|
|
mutex_exit(&ztest_checkpoint_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static vdev_t *
|
|
|
|
vdev_lookup_by_path(vdev_t *vd, const char *path)
|
|
|
|
{
|
|
|
|
vdev_t *mvd;
|
2010-08-26 20:52:39 +04:00
|
|
|
int c;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (vd->vdev_path != NULL && strcmp(path, vd->vdev_path) == 0)
|
|
|
|
return (vd);
|
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (c = 0; c < vd->vdev_children; c++)
|
2010-05-29 00:45:14 +04:00
|
|
|
if ((mvd = vdev_lookup_by_path(vd->vdev_child[c], path)) !=
|
|
|
|
NULL)
|
|
|
|
return (mvd);
|
|
|
|
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
static int
|
|
|
|
spa_num_top_vdevs(spa_t *spa)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
ASSERT3U(spa_config_held(spa, SCL_VDEV, RW_READER), ==, SCL_VDEV);
|
|
|
|
return (rvd->vdev_children);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that vdev_add() works as expected.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ztest_vdev_add_remove(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) zd, (void) id;
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_shared_t *zs = ztest_shared;
|
2012-01-24 06:43:32 +04:00
|
|
|
spa_t *spa = ztest_spa;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t leaves;
|
|
|
|
uint64_t guid;
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
uint64_t raidz_children;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
nvlist_t *nvroot;
|
|
|
|
int error;
|
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
if (ztest_opts.zo_mmp_test)
|
|
|
|
return;
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_enter(&ztest_vdev_lock);
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
raidz_children = ztest_get_raidz_children(spa);
|
|
|
|
leaves = MAX(zs->zs_mirrors + zs->zs_splits, 1) * raidz_children;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
|
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
ztest_shared->zs_vdev_next_leaf = spa_num_top_vdevs(spa) * leaves;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we have slogs then remove them 1/4 of the time.
|
|
|
|
*/
|
|
|
|
if (spa_has_slogs(spa) && ztest_random(4) == 0) {
|
2018-09-06 04:33:36 +03:00
|
|
|
metaslab_group_t *mg;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
2018-09-06 04:33:36 +03:00
|
|
|
* find the first real slog in log allocation class
|
2010-05-29 00:45:14 +04:00
|
|
|
*/
|
2020-12-15 21:55:44 +03:00
|
|
|
mg = spa_log_class(spa)->mc_allocator[0].mca_rotor;
|
2018-09-06 04:33:36 +03:00
|
|
|
while (!mg->mg_vd->vdev_islog)
|
|
|
|
mg = mg->mg_next;
|
|
|
|
|
|
|
|
guid = mg->mg_vd->vdev_guid;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
spa_config_exit(spa, SCL_VDEV, FTAG);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We have to grab the zs_name_lock as writer to
|
|
|
|
* prevent a race between removing a slog (dmu_objset_find)
|
|
|
|
* and destroying a dataset. Removing the slog will
|
|
|
|
* grab a reference on the dataset which may cause
|
2013-09-04 16:00:57 +04:00
|
|
|
* dsl_destroy_head() to fail with EBUSY thus
|
2010-05-29 00:45:14 +04:00
|
|
|
* leaving the dataset in an inconsistent state.
|
|
|
|
*/
|
2018-06-05 02:52:10 +03:00
|
|
|
pthread_rwlock_wrlock(&ztest_name_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
error = spa_vdev_remove(spa, guid, B_FALSE);
|
2018-06-05 02:52:10 +03:00
|
|
|
pthread_rwlock_unlock(&ztest_name_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2018-06-14 19:41:27 +03:00
|
|
|
switch (error) {
|
|
|
|
case 0:
|
|
|
|
case EEXIST: /* Generic zil_reset() error */
|
|
|
|
case EBUSY: /* Replay required */
|
|
|
|
case EACCES: /* Crypto key not loaded */
|
2016-12-17 01:11:29 +03:00
|
|
|
case ZFS_ERR_CHECKPOINT_EXISTS:
|
|
|
|
case ZFS_ERR_DISCARDING_CHECKPOINT:
|
2018-06-14 19:41:27 +03:00
|
|
|
break;
|
|
|
|
default:
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "spa_vdev_remove() = %d", error);
|
2018-06-14 19:41:27 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
} else {
|
|
|
|
spa_config_exit(spa, SCL_VDEV, FTAG);
|
|
|
|
|
|
|
|
/*
|
2018-09-06 04:33:36 +03:00
|
|
|
* Make 1/4 of the devices be log devices
|
2010-05-29 00:45:14 +04:00
|
|
|
*/
|
2012-12-15 04:28:49 +04:00
|
|
|
nvroot = make_vdev_root(NULL, NULL, NULL,
|
2018-09-06 04:33:36 +03:00
|
|
|
ztest_opts.zo_vdev_size, 0, (ztest_random(4) == 0) ?
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
"log" : NULL, raidz_children, zs->zs_mirrors,
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
1);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2024-03-29 22:15:56 +03:00
|
|
|
error = spa_vdev_add(spa, nvroot, B_FALSE);
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(nvroot);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
switch (error) {
|
|
|
|
case 0:
|
|
|
|
break;
|
|
|
|
case ENOSPC:
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_record_enospc("spa_vdev_add");
|
2016-12-17 01:11:29 +03:00
|
|
|
break;
|
|
|
|
default:
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "spa_vdev_add() = %d", error);
|
2016-12-17 01:11:29 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_exit(&ztest_vdev_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2018-09-06 04:33:36 +03:00
|
|
|
void
|
|
|
|
ztest_vdev_class_add(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) zd, (void) id;
|
2018-09-06 04:33:36 +03:00
|
|
|
ztest_shared_t *zs = ztest_shared;
|
|
|
|
spa_t *spa = ztest_spa;
|
|
|
|
uint64_t leaves;
|
|
|
|
nvlist_t *nvroot;
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
uint64_t raidz_children;
|
2018-09-06 04:33:36 +03:00
|
|
|
const char *class = (ztest_random(2) == 0) ?
|
|
|
|
VDEV_ALLOC_BIAS_SPECIAL : VDEV_ALLOC_BIAS_DEDUP;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* By default add a special vdev 50% of the time
|
|
|
|
*/
|
|
|
|
if ((ztest_opts.zo_special_vdevs == ZTEST_VDEV_CLASS_OFF) ||
|
|
|
|
(ztest_opts.zo_special_vdevs == ZTEST_VDEV_CLASS_RND &&
|
|
|
|
ztest_random(2) == 0)) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_enter(&ztest_vdev_lock);
|
|
|
|
|
|
|
|
/* Only test with mirrors */
|
|
|
|
if (zs->zs_mirrors < 2) {
|
|
|
|
mutex_exit(&ztest_vdev_lock);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* requires feature@allocation_classes */
|
|
|
|
if (!spa_feature_is_enabled(spa, SPA_FEATURE_ALLOCATION_CLASSES)) {
|
|
|
|
mutex_exit(&ztest_vdev_lock);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
raidz_children = ztest_get_raidz_children(spa);
|
|
|
|
leaves = MAX(zs->zs_mirrors + zs->zs_splits, 1) * raidz_children;
|
2018-09-06 04:33:36 +03:00
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
ztest_shared->zs_vdev_next_leaf = spa_num_top_vdevs(spa) * leaves;
|
2018-09-06 04:33:36 +03:00
|
|
|
spa_config_exit(spa, SCL_VDEV, FTAG);
|
|
|
|
|
|
|
|
nvroot = make_vdev_root(NULL, NULL, NULL, ztest_opts.zo_vdev_size, 0,
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
class, raidz_children, zs->zs_mirrors, 1);
|
2018-09-06 04:33:36 +03:00
|
|
|
|
2024-03-29 22:15:56 +03:00
|
|
|
error = spa_vdev_add(spa, nvroot, B_FALSE);
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(nvroot);
|
2018-09-06 04:33:36 +03:00
|
|
|
|
|
|
|
if (error == ENOSPC)
|
|
|
|
ztest_record_enospc("spa_vdev_add");
|
|
|
|
else if (error != 0)
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "spa_vdev_add() = %d", error);
|
2018-09-06 04:33:36 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* 50% of the time allow small blocks in the special class
|
|
|
|
*/
|
|
|
|
if (error == 0 &&
|
|
|
|
spa_special_class(spa)->mc_groups == 1 && ztest_random(2) == 0) {
|
|
|
|
if (ztest_opts.zo_verbose >= 3)
|
|
|
|
(void) printf("Enabling special VDEV small blocks\n");
|
2023-01-05 03:23:28 +03:00
|
|
|
error = ztest_dsl_prop_set_uint64(zd->zd_name,
|
2018-09-06 04:33:36 +03:00
|
|
|
ZFS_PROP_SPECIAL_SMALL_BLOCKS, 32768, B_FALSE);
|
2023-01-05 03:23:28 +03:00
|
|
|
ASSERT(error == 0 || error == ENOSPC);
|
2018-09-06 04:33:36 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
mutex_exit(&ztest_vdev_lock);
|
|
|
|
|
|
|
|
if (ztest_opts.zo_verbose >= 3) {
|
|
|
|
metaslab_class_t *mc;
|
|
|
|
|
|
|
|
if (strcmp(class, VDEV_ALLOC_BIAS_SPECIAL) == 0)
|
|
|
|
mc = spa_special_class(spa);
|
|
|
|
else
|
|
|
|
mc = spa_dedup_class(spa);
|
|
|
|
(void) printf("Added a %s mirrored vdev (of %d)\n",
|
|
|
|
class, (int)mc->mc_groups);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Verify that adding/removing aux devices (l2arc, hot spare) works as expected.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ztest_vdev_aux_add_remove(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) zd, (void) id;
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_shared_t *zs = ztest_shared;
|
2012-01-24 06:43:32 +04:00
|
|
|
spa_t *spa = ztest_spa;
|
2010-05-29 00:45:14 +04:00
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
spa_aux_vdev_t *sav;
|
2022-04-19 21:38:30 +03:00
|
|
|
const char *aux;
|
2010-08-26 22:13:05 +04:00
|
|
|
char *path;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t guid = 0;
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
int error, ignore_err = 0;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
if (ztest_opts.zo_mmp_test)
|
|
|
|
return;
|
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
path = umem_alloc(MAXPATHLEN, UMEM_NOFAIL);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (ztest_random(2) == 0) {
|
|
|
|
sav = &spa->spa_spares;
|
|
|
|
aux = ZPOOL_CONFIG_SPARES;
|
|
|
|
} else {
|
|
|
|
sav = &spa->spa_l2cache;
|
|
|
|
aux = ZPOOL_CONFIG_L2CACHE;
|
|
|
|
}
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_enter(&ztest_vdev_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
|
|
|
|
|
|
|
|
if (sav->sav_count != 0 && ztest_random(4) == 0) {
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* Pick a random device to remove.
|
|
|
|
*/
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
vdev_t *svd = sav->sav_vdevs[ztest_random(sav->sav_count)];
|
|
|
|
|
|
|
|
/* dRAID spares cannot be removed; try anyways to see ENOTSUP */
|
|
|
|
if (strstr(svd->vdev_path, VDEV_TYPE_DRAID) != NULL)
|
|
|
|
ignore_err = ENOTSUP;
|
|
|
|
|
|
|
|
guid = svd->vdev_guid;
|
2008-12-03 23:09:06 +03:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Find an unused device we can add.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
zs->zs_vdev_aux = 0;
|
2008-12-03 23:09:06 +03:00
|
|
|
for (;;) {
|
|
|
|
int c;
|
2013-07-02 08:07:15 +04:00
|
|
|
(void) snprintf(path, MAXPATHLEN, ztest_aux_template,
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_opts.zo_dir, ztest_opts.zo_pool, aux,
|
|
|
|
zs->zs_vdev_aux);
|
2008-12-03 23:09:06 +03:00
|
|
|
for (c = 0; c < sav->sav_count; c++)
|
|
|
|
if (strcmp(sav->sav_vdevs[c]->vdev_path,
|
|
|
|
path) == 0)
|
|
|
|
break;
|
|
|
|
if (c == sav->sav_count &&
|
|
|
|
vdev_lookup_by_path(rvd, path) == NULL)
|
|
|
|
break;
|
2010-05-29 00:45:14 +04:00
|
|
|
zs->zs_vdev_aux++;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_exit(spa, SCL_VDEV, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
if (guid == 0) {
|
|
|
|
/*
|
|
|
|
* Add a new device.
|
|
|
|
*/
|
2012-12-15 04:28:49 +04:00
|
|
|
nvlist_t *nvroot = make_vdev_root(NULL, aux, NULL,
|
2018-09-06 04:33:36 +03:00
|
|
|
(ztest_opts.zo_vdev_size * 5) / 4, 0, NULL, 0, 0, 1);
|
2024-03-29 22:15:56 +03:00
|
|
|
error = spa_vdev_add(spa, nvroot, B_FALSE);
|
2016-12-17 01:11:29 +03:00
|
|
|
|
|
|
|
switch (error) {
|
|
|
|
case 0:
|
|
|
|
break;
|
|
|
|
default:
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "spa_vdev_add(%p) = %d", nvroot, error);
|
2016-12-17 01:11:29 +03:00
|
|
|
}
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(nvroot);
|
2008-12-03 23:09:06 +03:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Remove an existing device. Sometimes, dirty its
|
|
|
|
* vdev state first to make sure we handle removal
|
|
|
|
* of devices that have pending state changes.
|
|
|
|
*/
|
|
|
|
if (ztest_random(2) == 0)
|
2009-07-03 02:44:48 +04:00
|
|
|
(void) vdev_online(spa, guid, 0, NULL);
|
2008-12-03 23:09:06 +03:00
|
|
|
|
|
|
|
error = spa_vdev_remove(spa, guid, B_FALSE);
|
2016-12-17 01:11:29 +03:00
|
|
|
|
|
|
|
switch (error) {
|
|
|
|
case 0:
|
|
|
|
case EBUSY:
|
|
|
|
case ZFS_ERR_CHECKPOINT_EXISTS:
|
|
|
|
case ZFS_ERR_DISCARDING_CHECKPOINT:
|
|
|
|
break;
|
|
|
|
default:
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
if (error != ignore_err)
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE,
|
|
|
|
"spa_vdev_remove(%"PRIu64") = %d",
|
|
|
|
guid, error);
|
2016-12-17 01:11:29 +03:00
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_exit(&ztest_vdev_lock);
|
2010-08-26 22:13:05 +04:00
|
|
|
|
|
|
|
umem_free(path, MAXPATHLEN);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* split a pool if it has mirror tlvdevs
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ztest_split_pool(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) zd, (void) id;
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_shared_t *zs = ztest_shared;
|
2012-01-24 06:43:32 +04:00
|
|
|
spa_t *spa = ztest_spa;
|
2010-05-29 00:45:14 +04:00
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
nvlist_t *tree, **child, *config, *split, **schild;
|
|
|
|
uint_t c, children, schildren = 0, lastlogid = 0;
|
|
|
|
int error = 0;
|
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
if (ztest_opts.zo_mmp_test)
|
|
|
|
return;
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_enter(&ztest_vdev_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2019-01-13 21:11:52 +03:00
|
|
|
/* ensure we have a usable config; mirrors of raidz aren't supported */
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
if (zs->zs_mirrors < 3 || ztest_opts.zo_raid_children > 1) {
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_exit(&ztest_vdev_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* clean up the old pool, if any */
|
|
|
|
(void) spa_destroy("splitp");
|
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
|
|
|
|
|
|
|
|
/* generate a config from the existing config */
|
|
|
|
mutex_enter(&spa->spa_props_lock);
|
2021-01-11 20:30:33 +03:00
|
|
|
tree = fnvlist_lookup_nvlist(spa->spa_config, ZPOOL_CONFIG_VDEV_TREE);
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_exit(&spa->spa_props_lock);
|
|
|
|
|
2021-01-11 20:30:33 +03:00
|
|
|
VERIFY0(nvlist_lookup_nvlist_array(tree, ZPOOL_CONFIG_CHILDREN,
|
|
|
|
&child, &children));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
Handle possible null pointers from malloc/strdup/strndup()
GCC 12.1.1_p20220625's static analyzer caught these.
Of the two in the btree test, one had previously been caught by Coverity
and Smatch, but GCC flagged it as a false positive. Upon examining how
other test cases handle this, the solution was changed from
`ASSERT3P(node, !=, NULL);` to using `perror()` to be consistent with
the fixes to the other fixes done to the ZTS code.
That approach was also used in ZED since I did not see a better way of
handling this there. Also, upon inspection, additional unchecked
pointers from malloc()/calloc()/strdup() were found in ZED, so those
were handled too.
In other parts of the code, the existing methods to avoid issues from
memory allocators returning NULL were used, such as using
`umem_alloc(size, UMEM_NOFAIL)` or returning `ENOMEM`.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13979
2022-10-07 03:18:40 +03:00
|
|
|
schild = umem_alloc(rvd->vdev_children * sizeof (nvlist_t *),
|
|
|
|
UMEM_NOFAIL);
|
2010-05-29 00:45:14 +04:00
|
|
|
for (c = 0; c < children; c++) {
|
|
|
|
vdev_t *tvd = rvd->vdev_child[c];
|
|
|
|
nvlist_t **mchild;
|
|
|
|
uint_t mchildren;
|
|
|
|
|
|
|
|
if (tvd->vdev_islog || tvd->vdev_ops == &vdev_hole_ops) {
|
2021-01-11 20:30:33 +03:00
|
|
|
schild[schildren] = fnvlist_alloc();
|
|
|
|
fnvlist_add_string(schild[schildren],
|
|
|
|
ZPOOL_CONFIG_TYPE, VDEV_TYPE_HOLE);
|
|
|
|
fnvlist_add_uint64(schild[schildren],
|
|
|
|
ZPOOL_CONFIG_IS_HOLE, 1);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (lastlogid == 0)
|
|
|
|
lastlogid = schildren;
|
|
|
|
++schildren;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
lastlogid = 0;
|
2021-01-11 20:30:33 +03:00
|
|
|
VERIFY0(nvlist_lookup_nvlist_array(child[c],
|
|
|
|
ZPOOL_CONFIG_CHILDREN, &mchild, &mchildren));
|
|
|
|
schild[schildren++] = fnvlist_dup(mchild[0]);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/* OK, create a config that can be used to split */
|
2021-01-11 20:30:33 +03:00
|
|
|
split = fnvlist_alloc();
|
|
|
|
fnvlist_add_string(split, ZPOOL_CONFIG_TYPE, VDEV_TYPE_ROOT);
|
2021-12-07 04:19:13 +03:00
|
|
|
fnvlist_add_nvlist_array(split, ZPOOL_CONFIG_CHILDREN,
|
|
|
|
(const nvlist_t **)schild, lastlogid != 0 ? lastlogid : schildren);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2021-01-11 20:30:33 +03:00
|
|
|
config = fnvlist_alloc();
|
|
|
|
fnvlist_add_nvlist(config, ZPOOL_CONFIG_VDEV_TREE, split);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
for (c = 0; c < schildren; c++)
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(schild[c]);
|
Handle possible null pointers from malloc/strdup/strndup()
GCC 12.1.1_p20220625's static analyzer caught these.
Of the two in the btree test, one had previously been caught by Coverity
and Smatch, but GCC flagged it as a false positive. Upon examining how
other test cases handle this, the solution was changed from
`ASSERT3P(node, !=, NULL);` to using `perror()` to be consistent with
the fixes to the other fixes done to the ZTS code.
That approach was also used in ZED since I did not see a better way of
handling this there. Also, upon inspection, additional unchecked
pointers from malloc()/calloc()/strdup() were found in ZED, so those
were handled too.
In other parts of the code, the existing methods to avoid issues from
memory allocators returning NULL were used, such as using
`umem_alloc(size, UMEM_NOFAIL)` or returning `ENOMEM`.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13979
2022-10-07 03:18:40 +03:00
|
|
|
umem_free(schild, rvd->vdev_children * sizeof (nvlist_t *));
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(split);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
spa_config_exit(spa, SCL_VDEV, FTAG);
|
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_wrlock(&ztest_name_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
error = spa_vdev_split_mirror(spa, "splitp", config, NULL, B_FALSE);
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&ztest_name_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(config);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (error == 0) {
|
|
|
|
(void) printf("successful split - results:\n");
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
show_pool_stats(spa);
|
|
|
|
show_pool_stats(spa_lookup("splitp"));
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
++zs->zs_splits;
|
|
|
|
--zs->zs_mirrors;
|
|
|
|
}
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_exit(&ztest_vdev_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that we can attach and detach devices.
|
|
|
|
*/
|
|
|
|
void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_vdev_attach_detach(ztest_ds_t *zd, uint64_t id)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) zd, (void) id;
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_shared_t *zs = ztest_shared;
|
2012-01-24 06:43:32 +04:00
|
|
|
spa_t *spa = ztest_spa;
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_aux_vdev_t *sav = &spa->spa_spares;
|
2008-11-20 23:01:55 +03:00
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
vdev_t *oldvd, *newvd, *pvd;
|
2008-12-03 23:09:06 +03:00
|
|
|
nvlist_t *root;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t leaves;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t leaf, top;
|
|
|
|
uint64_t ashift = ztest_get_ashift();
|
2009-01-16 00:59:39 +03:00
|
|
|
uint64_t oldguid, pguid;
|
2013-08-08 00:16:22 +04:00
|
|
|
uint64_t oldsize, newsize;
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
uint64_t raidz_children;
|
2010-08-26 22:13:05 +04:00
|
|
|
char *oldpath, *newpath;
|
2008-11-20 23:01:55 +03:00
|
|
|
int replacing;
|
2008-12-03 23:09:06 +03:00
|
|
|
int oldvd_has_siblings = B_FALSE;
|
|
|
|
int newvd_is_spare = B_FALSE;
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
int newvd_is_dspare = B_FALSE;
|
2008-12-03 23:09:06 +03:00
|
|
|
int oldvd_is_log;
|
2023-01-05 02:59:50 +03:00
|
|
|
int oldvd_is_special;
|
2008-11-20 23:01:55 +03:00
|
|
|
int error, expected_error;
|
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
if (ztest_opts.zo_mmp_test)
|
|
|
|
return;
|
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
oldpath = umem_alloc(MAXPATHLEN, UMEM_NOFAIL);
|
|
|
|
newpath = umem_alloc(MAXPATHLEN, UMEM_NOFAIL);
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_enter(&ztest_vdev_lock);
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
raidz_children = ztest_get_raidz_children(spa);
|
|
|
|
leaves = MAX(zs->zs_mirrors, 1) * raidz_children;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If a vdev is in the process of being removed, its removal may
|
|
|
|
* finish while we are in progress, leading to an unexpected error
|
|
|
|
* value. Don't bother trying to attach while we are in the middle
|
|
|
|
* of removal.
|
|
|
|
*/
|
OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.
There are two underlying problems:
1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.
The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).
2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.
Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.
The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).
The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.
This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).
Porting notes:
* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
to dsl_scan_needs_resilver(), which was added to ZoL as part of the
sequential scrub work.
* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
parameter. The extra parameter is unique to ZoL.
* When posting indirect checksum errors the ABD can be passed directly,
zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-02-13 22:37:56 +03:00
|
|
|
if (ztest_device_removal_active) {
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
2020-12-25 07:51:13 +03:00
|
|
|
goto out;
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
/*
|
|
|
|
* RAIDZ leaf VDEV mirrors are not currently supported while a
|
|
|
|
* RAIDZ expansion is in progress.
|
|
|
|
*/
|
|
|
|
if (ztest_opts.zo_raid_do_expand) {
|
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Decide whether to do an attach or a replace.
|
|
|
|
*/
|
|
|
|
replacing = ztest_random(2);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Pick a random top-level vdev.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
top = ztest_random_vdev_top(spa, B_TRUE);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Pick a random leaf within it.
|
|
|
|
*/
|
|
|
|
leaf = ztest_random(leaves);
|
|
|
|
|
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* Locate this vdev.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
oldvd = rvd->vdev_child[top];
|
2018-09-06 04:33:36 +03:00
|
|
|
|
|
|
|
/* pick a child from the mirror */
|
2010-05-29 00:45:14 +04:00
|
|
|
if (zs->zs_mirrors >= 1) {
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3P(oldvd->vdev_ops, ==, &vdev_mirror_ops);
|
|
|
|
ASSERT3U(oldvd->vdev_children, >=, zs->zs_mirrors);
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
oldvd = oldvd->vdev_child[leaf / raidz_children];
|
2009-01-16 00:59:39 +03:00
|
|
|
}
|
2018-09-06 04:33:36 +03:00
|
|
|
|
|
|
|
/* pick a child out of the raidz group */
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
if (ztest_opts.zo_raid_children > 1) {
|
|
|
|
if (strcmp(oldvd->vdev_ops->vdev_op_type, "raidz") == 0)
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3P(oldvd->vdev_ops, ==, &vdev_raidz_ops);
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
else
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3P(oldvd->vdev_ops, ==, &vdev_draid_ops);
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
oldvd = oldvd->vdev_child[leaf % raidz_children];
|
2009-01-16 00:59:39 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* If we're already doing an attach or replace, oldvd may be a
|
|
|
|
* mirror vdev -- in which case, pick a random child.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
while (oldvd->vdev_children != 0) {
|
|
|
|
oldvd_has_siblings = B_TRUE;
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(oldvd->vdev_children, >=, 2);
|
2009-01-16 00:59:39 +03:00
|
|
|
oldvd = oldvd->vdev_child[ztest_random(oldvd->vdev_children)];
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
oldguid = oldvd->vdev_guid;
|
2009-07-03 02:44:48 +04:00
|
|
|
oldsize = vdev_get_min_asize(oldvd);
|
2008-12-03 23:09:06 +03:00
|
|
|
oldvd_is_log = oldvd->vdev_top->vdev_islog;
|
2023-01-05 02:59:50 +03:00
|
|
|
oldvd_is_special =
|
|
|
|
oldvd->vdev_top->vdev_alloc_bias == VDEV_BIAS_SPECIAL ||
|
|
|
|
oldvd->vdev_top->vdev_alloc_bias == VDEV_BIAS_DEDUP;
|
Fix unsafe string operations
Coverity caught unsafe use of `strcpy()` in `ztest_dmu_objset_own()`,
`nfs_init_tmpfile()` and `dump_snapshot()`. It also caught an unsafe use
of `strlcat()` in `nfs_init_tmpfile()`.
Inspired by this, I did an audit of every single usage of `strcpy()` and
`strcat()` in the code. If I could not prove that the usage was safe, I
changed the code to use either `strlcpy()` or `strlcat()`, depending on
which function was originally used. In some cases, `snprintf()` was used
to replace multiple uses of `strcat` because it was cleaner.
Whenever I changed a function, I preferred to use `sizeof(dst)` when the
compiler is able to provide the string size via that. When it could not
because the string was passed by a caller, I checked the entire call
tree of the function to find out how big the buffer was and hard coded
it. Hardcoding is less than ideal, but it is safe unless someone shrinks
the buffer sizes being passed.
Additionally, Coverity reported three more string related issues:
* It caught a case where we do an overlapping memory copy in a call to
`snprintf()`. We fix that via `kmem_strdup()` and `kmem_strfree()`.
* It caught `sizeof (buf)` being used instead of `buflen` in
`zdb_nicenum()`'s call to `zfs_nicenum()`, which is passed to
`snprintf()`. We change that to pass `buflen`.
* It caught a theoretical unterminated string passed to `strcmp()`.
This one is likely a false positive, but we have the information
needed to do this more safely, so we change this to silence the false
positive not just in coverity, but potentially other static analysis
tools too. We switch to `strncmp()`.
* There was a false positive in tests/zfs-tests/cmd/dir_rd_update.c. We
suppress it by switching to `snprintf()` since other static analysis
tools might complain about it too. Interestingly, there is a possible
real bug there too, since it assumes that the passed directory path
ends with '/'. We add a '/' to fix that potential bug.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13913
2022-09-28 02:47:24 +03:00
|
|
|
(void) strlcpy(oldpath, oldvd->vdev_path, MAXPATHLEN);
|
2008-12-03 23:09:06 +03:00
|
|
|
pvd = oldvd->vdev_parent;
|
2009-01-16 00:59:39 +03:00
|
|
|
pguid = pvd->vdev_guid;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2019-01-18 20:47:55 +03:00
|
|
|
* If oldvd has siblings, then half of the time, detach it. Prior
|
|
|
|
* to the detach the pool is scrubbed in order to prevent creating
|
|
|
|
* unrepairable blocks as a result of the data corruption injection.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
if (oldvd_has_siblings && ztest_random(2) == 0) {
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
2019-01-18 20:47:55 +03:00
|
|
|
|
|
|
|
error = ztest_scrub_impl(spa);
|
|
|
|
if (error)
|
|
|
|
goto out;
|
|
|
|
|
2009-01-16 00:59:39 +03:00
|
|
|
error = spa_vdev_detach(spa, oldguid, pguid, B_FALSE);
|
|
|
|
if (error != 0 && error != ENODEV && error != EBUSY &&
|
2016-12-17 01:11:29 +03:00
|
|
|
error != ENOTSUP && error != ZFS_ERR_CHECKPOINT_EXISTS &&
|
|
|
|
error != ZFS_ERR_DISCARDING_CHECKPOINT)
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "detach (%s) returned %d",
|
|
|
|
oldpath, error);
|
2010-08-26 22:13:05 +04:00
|
|
|
goto out;
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* For the new vdev, choose with equal probability between the two
|
|
|
|
* standard paths (ending in either 'a' or 'b') or a random hot spare.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
if (sav->sav_count != 0 && ztest_random(3) == 0) {
|
|
|
|
newvd = sav->sav_vdevs[ztest_random(sav->sav_count)];
|
|
|
|
newvd_is_spare = B_TRUE;
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
|
|
|
|
if (newvd->vdev_ops == &vdev_draid_spare_ops)
|
|
|
|
newvd_is_dspare = B_TRUE;
|
|
|
|
|
Fix unsafe string operations
Coverity caught unsafe use of `strcpy()` in `ztest_dmu_objset_own()`,
`nfs_init_tmpfile()` and `dump_snapshot()`. It also caught an unsafe use
of `strlcat()` in `nfs_init_tmpfile()`.
Inspired by this, I did an audit of every single usage of `strcpy()` and
`strcat()` in the code. If I could not prove that the usage was safe, I
changed the code to use either `strlcpy()` or `strlcat()`, depending on
which function was originally used. In some cases, `snprintf()` was used
to replace multiple uses of `strcat` because it was cleaner.
Whenever I changed a function, I preferred to use `sizeof(dst)` when the
compiler is able to provide the string size via that. When it could not
because the string was passed by a caller, I checked the entire call
tree of the function to find out how big the buffer was and hard coded
it. Hardcoding is less than ideal, but it is safe unless someone shrinks
the buffer sizes being passed.
Additionally, Coverity reported three more string related issues:
* It caught a case where we do an overlapping memory copy in a call to
`snprintf()`. We fix that via `kmem_strdup()` and `kmem_strfree()`.
* It caught `sizeof (buf)` being used instead of `buflen` in
`zdb_nicenum()`'s call to `zfs_nicenum()`, which is passed to
`snprintf()`. We change that to pass `buflen`.
* It caught a theoretical unterminated string passed to `strcmp()`.
This one is likely a false positive, but we have the information
needed to do this more safely, so we change this to silence the false
positive not just in coverity, but potentially other static analysis
tools too. We switch to `strncmp()`.
* There was a false positive in tests/zfs-tests/cmd/dir_rd_update.c. We
suppress it by switching to `snprintf()` since other static analysis
tools might complain about it too. Interestingly, there is a possible
real bug there too, since it assumes that the passed directory path
ends with '/'. We add a '/' to fix that potential bug.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13913
2022-09-28 02:47:24 +03:00
|
|
|
(void) strlcpy(newpath, newvd->vdev_path, MAXPATHLEN);
|
2008-12-03 23:09:06 +03:00
|
|
|
} else {
|
Fix ztest vdev file paths.
Currently, in several instances (but not all), ztest generates vdev
file paths using a statement similar to this:
snprintf(path, sizeof (path), ztest_dev_template, ...);
This worked fine until 40b84e7aec6392187722e61e5a4a853b530bf60f, which
changed path to be a pointer to the heap instead of an array allocated
on the stack. Before this change, sizeof(path) would return the size of
the array; now, it returns the size of the pointer instead.
As a result, the aforementioned sprintf statement uses the wrong size
and truncates the vdev file path to the first 4 or 8 bytes (depending
on the architecture). Typically, with default settings, the file path
will become "/tmp/zt" instead of "/test/ztest.XXX".
This issue only exists in ztest_vdev_attach_detach() and
ztest_fault_inject(), which explains why ztest doesn't fail right away.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #989
2012-10-01 18:34:52 +04:00
|
|
|
(void) snprintf(newpath, MAXPATHLEN, ztest_dev_template,
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_opts.zo_dir, ztest_opts.zo_pool,
|
|
|
|
top * leaves + leaf);
|
2008-12-03 23:09:06 +03:00
|
|
|
if (ztest_random(2) == 0)
|
|
|
|
newpath[strlen(newpath) - 1] = 'b';
|
|
|
|
newvd = vdev_lookup_by_path(rvd, newpath);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (newvd) {
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
/*
|
|
|
|
* Reopen to ensure the vdev's asize field isn't stale.
|
|
|
|
*/
|
|
|
|
vdev_reopen(newvd);
|
2009-07-03 02:44:48 +04:00
|
|
|
newsize = vdev_get_min_asize(newvd);
|
2008-12-03 23:09:06 +03:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Make newsize a little bigger or smaller than oldsize.
|
|
|
|
* If it's smaller, the attach should fail.
|
|
|
|
* If it's larger, and we're doing a replace,
|
|
|
|
* we should get dynamic LUN growth when we're done.
|
|
|
|
*/
|
|
|
|
newsize = 10 * oldsize / (9 + ztest_random(3));
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If pvd is not a mirror or root, the attach should fail with ENOTSUP,
|
|
|
|
* unless it's a replace; in that case any non-replacing parent is OK.
|
|
|
|
*
|
|
|
|
* If newvd is already part of the pool, it should fail with EBUSY.
|
|
|
|
*
|
|
|
|
* If newvd is too small, it should fail with EOVERFLOW.
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
*
|
|
|
|
* If newvd is a distributed spare and it's being attached to a
|
|
|
|
* dRAID which is not its parent it should fail with EINVAL.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
if (pvd->vdev_ops != &vdev_mirror_ops &&
|
|
|
|
pvd->vdev_ops != &vdev_root_ops && (!replacing ||
|
|
|
|
pvd->vdev_ops == &vdev_replacing_ops ||
|
|
|
|
pvd->vdev_ops == &vdev_spare_ops))
|
2008-11-20 23:01:55 +03:00
|
|
|
expected_error = ENOTSUP;
|
2023-01-05 02:59:50 +03:00
|
|
|
else if (newvd_is_spare &&
|
|
|
|
(!replacing || oldvd_is_log || oldvd_is_special))
|
2008-12-03 23:09:06 +03:00
|
|
|
expected_error = ENOTSUP;
|
|
|
|
else if (newvd == oldvd)
|
|
|
|
expected_error = replacing ? 0 : EBUSY;
|
|
|
|
else if (vdev_lookup_by_path(rvd, newpath) != NULL)
|
|
|
|
expected_error = EBUSY;
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
else if (!newvd_is_dspare && newsize < oldsize)
|
2008-11-20 23:01:55 +03:00
|
|
|
expected_error = EOVERFLOW;
|
|
|
|
else if (ashift > oldvd->vdev_top->vdev_ashift)
|
|
|
|
expected_error = EDOM;
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
else if (newvd_is_dspare && pvd != vdev_draid_spare_get_parent(newvd))
|
2023-09-19 19:02:23 +03:00
|
|
|
expected_error = EINVAL;
|
2008-11-20 23:01:55 +03:00
|
|
|
else
|
|
|
|
expected_error = 0;
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Build the nvlist describing newpath.
|
|
|
|
*/
|
2012-12-15 04:28:49 +04:00
|
|
|
root = make_vdev_root(newpath, NULL, NULL, newvd == NULL ? newsize : 0,
|
2018-09-06 04:33:36 +03:00
|
|
|
ashift, NULL, 0, 0, 1);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2020-07-03 21:05:50 +03:00
|
|
|
/*
|
|
|
|
* When supported select either a healing or sequential resilver.
|
|
|
|
*/
|
|
|
|
boolean_t rebuilding = B_FALSE;
|
|
|
|
if (pvd->vdev_ops == &vdev_mirror_ops ||
|
|
|
|
pvd->vdev_ops == &vdev_root_ops) {
|
|
|
|
rebuilding = !!ztest_random(2);
|
|
|
|
}
|
|
|
|
|
|
|
|
error = spa_vdev_attach(spa, oldguid, root, replacing, rebuilding);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(root);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If our parent was the replacing vdev, but the replace completed,
|
|
|
|
* then instead of failing with ENOTSUP we may either succeed,
|
|
|
|
* fail with ENODEV, or fail with EOVERFLOW.
|
|
|
|
*/
|
|
|
|
if (expected_error == ENOTSUP &&
|
|
|
|
(error == 0 || error == ENODEV || error == EOVERFLOW))
|
|
|
|
expected_error = error;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If someone grew the LUN, the replacement may be too small.
|
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
if (error == EOVERFLOW || error == EBUSY)
|
2008-11-20 23:01:55 +03:00
|
|
|
expected_error = error;
|
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
if (error == ZFS_ERR_CHECKPOINT_EXISTS ||
|
2020-07-03 21:05:50 +03:00
|
|
|
error == ZFS_ERR_DISCARDING_CHECKPOINT ||
|
|
|
|
error == ZFS_ERR_RESILVER_IN_PROGRESS ||
|
|
|
|
error == ZFS_ERR_REBUILD_IN_PROGRESS)
|
2016-12-17 01:11:29 +03:00
|
|
|
expected_error = error;
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
if (error != expected_error && expected_error != EBUSY) {
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "attach (%s %"PRIu64", %s %"PRIu64", %d) "
|
2008-12-03 23:09:06 +03:00
|
|
|
"returned %d, expected %d",
|
2013-08-08 00:16:22 +04:00
|
|
|
oldpath, oldsize, newpath,
|
|
|
|
newsize, replacing, error, expected_error);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-08-26 22:13:05 +04:00
|
|
|
out:
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_exit(&ztest_vdev_lock);
|
2010-08-26 22:13:05 +04:00
|
|
|
|
|
|
|
umem_free(oldpath, MAXPATHLEN);
|
|
|
|
umem_free(newpath, MAXPATHLEN);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
static void
|
|
|
|
raidz_scratch_verify(void)
|
|
|
|
{
|
|
|
|
spa_t *spa;
|
|
|
|
uint64_t write_size, logical_size, offset;
|
|
|
|
raidz_reflow_scratch_state_t state;
|
|
|
|
vdev_raidz_expand_t *vre;
|
|
|
|
vdev_t *raidvd;
|
|
|
|
|
|
|
|
ASSERT(raidz_expand_pause_point == RAIDZ_EXPAND_PAUSE_NONE);
|
|
|
|
|
|
|
|
if (ztest_scratch_state->zs_raidz_scratch_verify_pause == 0)
|
|
|
|
return;
|
|
|
|
|
|
|
|
kernel_init(SPA_MODE_READ);
|
|
|
|
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
spa = spa_lookup(ztest_opts.zo_pool);
|
|
|
|
ASSERT(spa);
|
|
|
|
spa->spa_import_flags |= ZFS_IMPORT_SKIP_MMP;
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
|
|
|
|
VERIFY0(spa_open(ztest_opts.zo_pool, &spa, FTAG));
|
|
|
|
|
|
|
|
ASSERT3U(RRSS_GET_OFFSET(&spa->spa_uberblock), !=, UINT64_MAX);
|
|
|
|
|
|
|
|
mutex_enter(&ztest_vdev_lock);
|
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_ALL, FTAG, RW_READER);
|
|
|
|
|
|
|
|
vre = spa->spa_raidz_expand;
|
|
|
|
if (vre == NULL)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
raidvd = vdev_lookup_top(spa, vre->vre_vdev_id);
|
|
|
|
offset = RRSS_GET_OFFSET(&spa->spa_uberblock);
|
|
|
|
state = RRSS_GET_STATE(&spa->spa_uberblock);
|
2024-05-10 18:47:21 +03:00
|
|
|
write_size = P2ALIGN_TYPED(VDEV_BOOT_SIZE, 1 << raidvd->vdev_ashift,
|
|
|
|
uint64_t);
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
logical_size = write_size * raidvd->vdev_children;
|
|
|
|
|
|
|
|
switch (state) {
|
|
|
|
/*
|
|
|
|
* Initial state of reflow process. RAIDZ expansion was
|
|
|
|
* requested by user, but scratch object was not created.
|
|
|
|
*/
|
|
|
|
case RRSS_SCRATCH_NOT_IN_USE:
|
|
|
|
ASSERT3U(offset, ==, 0);
|
|
|
|
break;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Scratch object was synced and stored in boot area.
|
|
|
|
*/
|
|
|
|
case RRSS_SCRATCH_VALID:
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Scratch object was synced back to raidz start offset,
|
|
|
|
* raidz is ready for sector by sector reflow process.
|
|
|
|
*/
|
|
|
|
case RRSS_SCRATCH_INVALID_SYNCED:
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Scratch object was synced back to raidz start offset
|
|
|
|
* on zpool importing, raidz is ready for sector by sector
|
|
|
|
* reflow process.
|
|
|
|
*/
|
|
|
|
case RRSS_SCRATCH_INVALID_SYNCED_ON_IMPORT:
|
|
|
|
ASSERT3U(offset, ==, logical_size);
|
|
|
|
break;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Sector by sector reflow process started.
|
|
|
|
*/
|
|
|
|
case RRSS_SCRATCH_INVALID_SYNCED_REFLOW:
|
|
|
|
ASSERT3U(offset, >=, logical_size);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
out:
|
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
|
|
|
|
|
|
|
mutex_exit(&ztest_vdev_lock);
|
|
|
|
|
|
|
|
ztest_scratch_state->zs_raidz_scratch_verify_pause = 0;
|
|
|
|
|
|
|
|
spa_close(spa, FTAG);
|
|
|
|
kernel_fini();
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
ztest_scratch_thread(void *arg)
|
|
|
|
{
|
|
|
|
(void) arg;
|
|
|
|
|
|
|
|
/* wait up to 10 seconds */
|
|
|
|
for (int t = 100; t > 0; t -= 1) {
|
|
|
|
if (raidz_expand_pause_point == RAIDZ_EXPAND_PAUSE_NONE)
|
|
|
|
thread_exit();
|
|
|
|
|
|
|
|
(void) poll(NULL, 0, 100);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* killed when the scratch area progress reached a certain point */
|
|
|
|
ztest_kill(ztest_shared);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that we can attach raidz device.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ztest_vdev_raidz_attach(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
|
|
|
(void) zd, (void) id;
|
|
|
|
ztest_shared_t *zs = ztest_shared;
|
|
|
|
spa_t *spa = ztest_spa;
|
|
|
|
uint64_t leaves, raidz_children, newsize, ashift = ztest_get_ashift();
|
|
|
|
kthread_t *scratch_thread = NULL;
|
|
|
|
vdev_t *newvd, *pvd;
|
|
|
|
nvlist_t *root;
|
|
|
|
char *newpath = umem_alloc(MAXPATHLEN, UMEM_NOFAIL);
|
|
|
|
int error, expected_error = 0;
|
|
|
|
|
|
|
|
mutex_enter(&ztest_vdev_lock);
|
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_ALL, FTAG, RW_READER);
|
|
|
|
|
|
|
|
/* Only allow attach when raid-kind = 'eraidz' */
|
|
|
|
if (!ztest_opts.zo_raid_do_expand) {
|
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (ztest_opts.zo_mmp_test) {
|
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (ztest_device_removal_active) {
|
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
pvd = vdev_lookup_top(spa, 0);
|
|
|
|
|
|
|
|
ASSERT(pvd->vdev_ops == &vdev_raidz_ops);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Get size of a child of the raidz group,
|
|
|
|
* make sure device is a bit bigger
|
|
|
|
*/
|
|
|
|
newvd = pvd->vdev_child[ztest_random(pvd->vdev_children)];
|
|
|
|
newsize = 10 * vdev_get_min_asize(newvd) / (9 + ztest_random(2));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Get next attached leaf id
|
|
|
|
*/
|
|
|
|
raidz_children = ztest_get_raidz_children(spa);
|
|
|
|
leaves = MAX(zs->zs_mirrors + zs->zs_splits, 1) * raidz_children;
|
|
|
|
zs->zs_vdev_next_leaf = spa_num_top_vdevs(spa) * leaves;
|
|
|
|
|
|
|
|
if (spa->spa_raidz_expand)
|
|
|
|
expected_error = ZFS_ERR_RAIDZ_EXPAND_IN_PROGRESS;
|
|
|
|
|
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Path to vdev to be attached
|
|
|
|
*/
|
|
|
|
(void) snprintf(newpath, MAXPATHLEN, ztest_dev_template,
|
|
|
|
ztest_opts.zo_dir, ztest_opts.zo_pool, zs->zs_vdev_next_leaf);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Build the nvlist describing newpath.
|
|
|
|
*/
|
|
|
|
root = make_vdev_root(newpath, NULL, NULL, newsize, ashift, NULL,
|
|
|
|
0, 0, 1);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* 50% of the time, set raidz_expand_pause_point to cause
|
|
|
|
* raidz_reflow_scratch_sync() to pause at a certain point and
|
|
|
|
* then kill the test after 10 seconds so raidz_scratch_verify()
|
|
|
|
* can confirm consistency when the pool is imported.
|
|
|
|
*/
|
|
|
|
if (ztest_random(2) == 0 && expected_error == 0) {
|
|
|
|
raidz_expand_pause_point =
|
|
|
|
ztest_random(RAIDZ_EXPAND_PAUSE_SCRATCH_POST_REFLOW_2) + 1;
|
|
|
|
scratch_thread = thread_create(NULL, 0, ztest_scratch_thread,
|
|
|
|
ztest_shared, 0, NULL, TS_RUN | TS_JOINABLE, defclsyspri);
|
|
|
|
}
|
|
|
|
|
|
|
|
error = spa_vdev_attach(spa, pvd->vdev_guid, root, B_FALSE, B_FALSE);
|
|
|
|
|
|
|
|
nvlist_free(root);
|
|
|
|
|
|
|
|
if (error == EOVERFLOW || error == ENXIO ||
|
|
|
|
error == ZFS_ERR_CHECKPOINT_EXISTS ||
|
|
|
|
error == ZFS_ERR_DISCARDING_CHECKPOINT)
|
|
|
|
expected_error = error;
|
|
|
|
|
|
|
|
if (error != 0 && error != expected_error) {
|
|
|
|
fatal(0, "raidz attach (%s %"PRIu64") returned %d, expected %d",
|
|
|
|
newpath, newsize, error, expected_error);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (raidz_expand_pause_point) {
|
|
|
|
if (error != 0) {
|
|
|
|
/*
|
|
|
|
* Do not verify scratch object in case of error
|
|
|
|
* returned by vdev attaching.
|
|
|
|
*/
|
|
|
|
raidz_expand_pause_point = RAIDZ_EXPAND_PAUSE_NONE;
|
|
|
|
}
|
|
|
|
|
|
|
|
VERIFY0(thread_join(scratch_thread));
|
|
|
|
}
|
|
|
|
out:
|
|
|
|
mutex_exit(&ztest_vdev_lock);
|
|
|
|
|
|
|
|
umem_free(newpath, MAXPATHLEN);
|
|
|
|
}
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
void
|
|
|
|
ztest_device_removal(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) zd, (void) id;
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
spa_t *spa = ztest_spa;
|
|
|
|
vdev_t *vd;
|
|
|
|
uint64_t guid;
|
OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.
There are two underlying problems:
1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.
The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).
2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.
Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.
The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).
The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.
This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).
Porting notes:
* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
to dsl_scan_needs_resilver(), which was added to ZoL as part of the
sequential scrub work.
* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
parameter. The extra parameter is unique to ZoL.
* When posting indirect checksum errors the ABD can be passed directly,
zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-02-13 22:37:56 +03:00
|
|
|
int error;
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
|
|
|
mutex_enter(&ztest_vdev_lock);
|
|
|
|
|
OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.
There are two underlying problems:
1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.
The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).
2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.
Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.
The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).
The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.
This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).
Porting notes:
* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
to dsl_scan_needs_resilver(), which was added to ZoL as part of the
sequential scrub work.
* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
parameter. The extra parameter is unique to ZoL.
* When posting indirect checksum errors the ABD can be passed directly,
zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-02-13 22:37:56 +03:00
|
|
|
if (ztest_device_removal_active) {
|
|
|
|
mutex_exit(&ztest_vdev_lock);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remove a random top-level vdev and wait for removal to finish.
|
|
|
|
*/
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
|
|
|
|
vd = vdev_lookup_top(spa, ztest_random_vdev_top(spa, B_FALSE));
|
|
|
|
guid = vd->vdev_guid;
|
|
|
|
spa_config_exit(spa, SCL_VDEV, FTAG);
|
|
|
|
|
OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.
There are two underlying problems:
1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.
The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).
2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.
Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.
The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).
The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.
This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).
Porting notes:
* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
to dsl_scan_needs_resilver(), which was added to ZoL as part of the
sequential scrub work.
* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
parameter. The extra parameter is unique to ZoL.
* When posting indirect checksum errors the ABD can be passed directly,
zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-02-13 22:37:56 +03:00
|
|
|
error = spa_vdev_remove(spa, guid, B_FALSE);
|
|
|
|
if (error == 0) {
|
|
|
|
ztest_device_removal_active = B_TRUE;
|
|
|
|
mutex_exit(&ztest_vdev_lock);
|
|
|
|
|
2018-10-22 16:54:01 +03:00
|
|
|
/*
|
|
|
|
* spa->spa_vdev_removal is created in a sync task that
|
|
|
|
* is initiated via dsl_sync_task_nowait(). Since the
|
|
|
|
* task may not run before spa_vdev_remove() returns, we
|
|
|
|
* must wait at least 1 txg to ensure that the removal
|
|
|
|
* struct has been created.
|
|
|
|
*/
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
|
2018-11-29 07:47:09 +03:00
|
|
|
while (spa->spa_removing_phys.sr_state == DSS_SCANNING)
|
OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.
There are two underlying problems:
1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.
The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).
2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.
Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.
The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).
The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.
This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).
Porting notes:
* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
to dsl_scan_needs_resilver(), which was added to ZoL as part of the
sequential scrub work.
* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
parameter. The extra parameter is unique to ZoL.
* When posting indirect checksum errors the ABD can be passed directly,
zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-02-13 22:37:56 +03:00
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
} else {
|
|
|
|
mutex_exit(&ztest_vdev_lock);
|
|
|
|
return;
|
|
|
|
}
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.
There are two underlying problems:
1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.
The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).
2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.
Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.
The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).
The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.
This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).
Porting notes:
* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
to dsl_scan_needs_resilver(), which was added to ZoL as part of the
sequential scrub work.
* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
parameter. The extra parameter is unique to ZoL.
* When posting indirect checksum errors the ABD can be passed directly,
zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-02-13 22:37:56 +03:00
|
|
|
/*
|
|
|
|
* The pool needs to be scrubbed after completing device removal.
|
|
|
|
* Failure to do so may result in checksum errors due to the
|
|
|
|
* strategy employed by ztest_fault_inject() when selecting which
|
|
|
|
* offset are redundant and can be damaged.
|
|
|
|
*/
|
|
|
|
error = spa_scan(spa, POOL_SCAN_SCRUB);
|
|
|
|
if (error == 0) {
|
|
|
|
while (dsl_scan_scrubbing(spa_get_dsl(spa)))
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_enter(&ztest_vdev_lock);
|
|
|
|
ztest_device_removal_active = B_FALSE;
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
mutex_exit(&ztest_vdev_lock);
|
|
|
|
}
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
/*
|
|
|
|
* Callback function which expands the physical size of the vdev.
|
|
|
|
*/
|
2020-06-15 21:30:37 +03:00
|
|
|
static vdev_t *
|
2009-07-03 02:44:48 +04:00
|
|
|
grow_vdev(vdev_t *vd, void *arg)
|
|
|
|
{
|
2019-12-05 23:37:00 +03:00
|
|
|
spa_t *spa __maybe_unused = vd->vdev_spa;
|
2009-07-03 02:44:48 +04:00
|
|
|
size_t *newsize = arg;
|
|
|
|
size_t fsize;
|
|
|
|
int fd;
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3S(spa_config_held(spa, SCL_STATE, RW_READER), ==, SCL_STATE);
|
2009-07-03 02:44:48 +04:00
|
|
|
ASSERT(vd->vdev_ops->vdev_op_leaf);
|
|
|
|
|
|
|
|
if ((fd = open(vd->vdev_path, O_RDWR)) == -1)
|
|
|
|
return (vd);
|
|
|
|
|
|
|
|
fsize = lseek(fd, 0, SEEK_END);
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(ftruncate(fd, *newsize));
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 6) {
|
2009-07-03 02:44:48 +04:00
|
|
|
(void) printf("%s grew from %lu to %lu bytes\n",
|
|
|
|
vd->vdev_path, (ulong_t)fsize, (ulong_t)*newsize);
|
|
|
|
}
|
|
|
|
(void) close(fd);
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Callback function which expands a given vdev by calling vdev_online().
|
|
|
|
*/
|
2020-06-15 21:30:37 +03:00
|
|
|
static vdev_t *
|
2009-07-03 02:44:48 +04:00
|
|
|
online_vdev(vdev_t *vd, void *arg)
|
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) arg;
|
2009-07-03 02:44:48 +04:00
|
|
|
spa_t *spa = vd->vdev_spa;
|
|
|
|
vdev_t *tvd = vd->vdev_top;
|
|
|
|
uint64_t guid = vd->vdev_guid;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t generation = spa->spa_config_generation + 1;
|
|
|
|
vdev_state_t newstate = VDEV_STATE_UNKNOWN;
|
|
|
|
int error;
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3S(spa_config_held(spa, SCL_STATE, RW_READER), ==, SCL_STATE);
|
2009-07-03 02:44:48 +04:00
|
|
|
ASSERT(vd->vdev_ops->vdev_op_leaf);
|
|
|
|
|
|
|
|
/* Calling vdev_online will initialize the new metaslabs */
|
|
|
|
spa_config_exit(spa, SCL_STATE, spa);
|
2010-05-29 00:45:14 +04:00
|
|
|
error = vdev_online(spa, guid, ZFS_ONLINE_EXPAND, &newstate);
|
2009-07-03 02:44:48 +04:00
|
|
|
spa_config_enter(spa, SCL_STATE, spa, RW_READER);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* If vdev_online returned an error or the underlying vdev_open
|
|
|
|
* failed then we abort the expand. The only way to know that
|
|
|
|
* vdev_open fails is by checking the returned newstate.
|
|
|
|
*/
|
|
|
|
if (error || newstate != VDEV_STATE_HEALTHY) {
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 5) {
|
2021-06-05 15:18:37 +03:00
|
|
|
(void) printf("Unable to expand vdev, state %u, "
|
|
|
|
"error %d\n", newstate, error);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
return (vd);
|
|
|
|
}
|
|
|
|
ASSERT3U(newstate, ==, VDEV_STATE_HEALTHY);
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
/*
|
|
|
|
* Since we dropped the lock we need to ensure that we're
|
|
|
|
* still talking to the original vdev. It's possible this
|
|
|
|
* vdev may have been detached/replaced while we were
|
|
|
|
* trying to online it.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
if (generation != spa->spa_config_generation) {
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 5) {
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("vdev configuration has changed, "
|
2021-06-05 15:18:37 +03:00
|
|
|
"guid %"PRIu64", state %"PRIu64", "
|
|
|
|
"expected gen %"PRIu64", got gen %"PRIu64"\n",
|
|
|
|
guid,
|
|
|
|
tvd->vdev_state,
|
|
|
|
generation,
|
|
|
|
spa->spa_config_generation);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
return (vd);
|
|
|
|
}
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Traverse the vdev tree calling the supplied function.
|
|
|
|
* We continue to walk the tree until we either have walked all
|
|
|
|
* children or we receive a non-NULL return from the callback.
|
|
|
|
* If a NULL callback is passed, then we just return back the first
|
|
|
|
* leaf vdev we encounter.
|
|
|
|
*/
|
2020-06-15 21:30:37 +03:00
|
|
|
static vdev_t *
|
2009-07-03 02:44:48 +04:00
|
|
|
vdev_walk_tree(vdev_t *vd, vdev_t *(*func)(vdev_t *, void *), void *arg)
|
|
|
|
{
|
2010-08-26 20:52:39 +04:00
|
|
|
uint_t c;
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
if (vd->vdev_ops->vdev_op_leaf) {
|
|
|
|
if (func == NULL)
|
|
|
|
return (vd);
|
|
|
|
else
|
|
|
|
return (func(vd, arg));
|
|
|
|
}
|
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (c = 0; c < vd->vdev_children; c++) {
|
2009-07-03 02:44:48 +04:00
|
|
|
vdev_t *cvd = vd->vdev_child[c];
|
|
|
|
if ((cvd = vdev_walk_tree(cvd, func, arg)) != NULL)
|
|
|
|
return (cvd);
|
|
|
|
}
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Verify that dynamic LUN growth works as expected.
|
|
|
|
*/
|
|
|
|
void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_vdev_LUN_growth(ztest_ds_t *zd, uint64_t id)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) zd, (void) id;
|
2012-01-24 06:43:32 +04:00
|
|
|
spa_t *spa = ztest_spa;
|
2010-05-29 00:45:14 +04:00
|
|
|
vdev_t *vd, *tvd;
|
|
|
|
metaslab_class_t *mc;
|
|
|
|
metaslab_group_t *mg;
|
2009-07-03 02:44:48 +04:00
|
|
|
size_t psize, newsize;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t top;
|
|
|
|
uint64_t old_class_space, new_class_space, old_ms_count, new_ms_count;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
mutex_enter(&ztest_checkpoint_lock);
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_enter(&ztest_vdev_lock);
|
2009-07-03 02:44:48 +04:00
|
|
|
spa_config_enter(spa, SCL_STATE, spa, RW_READER);
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
/*
|
|
|
|
* If there is a vdev removal in progress, it could complete while
|
|
|
|
* we are running, in which case we would not be able to verify
|
|
|
|
* that the metaslab_class space increased (because it decreases
|
|
|
|
* when the device removal completes).
|
|
|
|
*/
|
OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.
There are two underlying problems:
1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.
The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).
2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.
Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.
The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).
The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.
This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).
Porting notes:
* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
to dsl_scan_needs_resilver(), which was added to ZoL as part of the
sequential scrub work.
* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
parameter. The extra parameter is unique to ZoL.
* When posting indirect checksum errors the ABD can be passed directly,
zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-02-13 22:37:56 +03:00
|
|
|
if (ztest_device_removal_active) {
|
2016-12-17 01:11:29 +03:00
|
|
|
spa_config_exit(spa, SCL_STATE, spa);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
mutex_exit(&ztest_vdev_lock);
|
2016-12-17 01:11:29 +03:00
|
|
|
mutex_exit(&ztest_checkpoint_lock);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
/*
|
|
|
|
* If we are under raidz expansion, the test can failed because the
|
|
|
|
* metaslabs count will not increase immediately after the vdev is
|
|
|
|
* expanded. It will happen only after raidz expansion completion.
|
|
|
|
*/
|
|
|
|
if (spa->spa_raidz_expand) {
|
|
|
|
spa_config_exit(spa, SCL_STATE, spa);
|
|
|
|
mutex_exit(&ztest_vdev_lock);
|
|
|
|
mutex_exit(&ztest_checkpoint_lock);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
top = ztest_random_vdev_top(spa, B_TRUE);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
tvd = spa->spa_root_vdev->vdev_child[top];
|
|
|
|
mg = tvd->vdev_mg;
|
|
|
|
mc = mg->mg_class;
|
|
|
|
old_ms_count = tvd->vdev_ms_count;
|
|
|
|
old_class_space = metaslab_class_get_space(mc);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2009-07-03 02:44:48 +04:00
|
|
|
* Determine the size of the first leaf vdev associated with
|
|
|
|
* our top-level device.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2009-07-03 02:44:48 +04:00
|
|
|
vd = vdev_walk_tree(tvd, NULL, NULL);
|
|
|
|
ASSERT3P(vd, !=, NULL);
|
|
|
|
ASSERT(vd->vdev_ops->vdev_op_leaf);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
psize = vd->vdev_psize;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* We only try to expand the vdev if it's healthy, less than 4x its
|
|
|
|
* original size, and it has a valid psize.
|
2009-07-03 02:44:48 +04:00
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
if (tvd->vdev_state != VDEV_STATE_HEALTHY ||
|
2012-01-24 06:43:32 +04:00
|
|
|
psize == 0 || psize >= 4 * ztest_opts.zo_vdev_size) {
|
2009-07-03 02:44:48 +04:00
|
|
|
spa_config_exit(spa, SCL_STATE, spa);
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_exit(&ztest_vdev_lock);
|
2016-12-17 01:11:29 +03:00
|
|
|
mutex_exit(&ztest_checkpoint_lock);
|
2009-07-03 02:44:48 +04:00
|
|
|
return;
|
|
|
|
}
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(psize, >, 0);
|
2018-09-06 04:33:36 +03:00
|
|
|
newsize = psize + MAX(psize / 8, SPA_MAXBLOCKSIZE);
|
2009-07-03 02:44:48 +04:00
|
|
|
ASSERT3U(newsize, >, psize);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 6) {
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("Expanding LUN %s from %lu to %lu\n",
|
2009-07-03 02:44:48 +04:00
|
|
|
vd->vdev_path, (ulong_t)psize, (ulong_t)newsize);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Growing the vdev is a two step process:
|
|
|
|
* 1). expand the physical size (i.e. relabel)
|
|
|
|
* 2). online the vdev to create the new metaslabs
|
|
|
|
*/
|
|
|
|
if (vdev_walk_tree(tvd, grow_vdev, &newsize) != NULL ||
|
|
|
|
vdev_walk_tree(tvd, online_vdev, NULL) != NULL ||
|
|
|
|
tvd->vdev_state != VDEV_STATE_HEALTHY) {
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 5) {
|
2009-07-03 02:44:48 +04:00
|
|
|
(void) printf("Could not expand LUN because "
|
2010-05-29 00:45:14 +04:00
|
|
|
"the vdev configuration changed.\n");
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_config_exit(spa, SCL_STATE, spa);
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_exit(&ztest_vdev_lock);
|
2016-12-17 01:11:29 +03:00
|
|
|
mutex_exit(&ztest_checkpoint_lock);
|
2009-07-03 02:44:48 +04:00
|
|
|
return;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_config_exit(spa, SCL_STATE, spa);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Expanding the LUN will update the config asynchronously,
|
|
|
|
* thus we must wait for the async thread to complete any
|
|
|
|
* pending tasks before proceeding.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
for (;;) {
|
|
|
|
boolean_t done;
|
|
|
|
mutex_enter(&spa->spa_async_lock);
|
|
|
|
done = (spa->spa_async_thread == NULL && !spa->spa_async_tasks);
|
|
|
|
mutex_exit(&spa->spa_async_lock);
|
|
|
|
if (done)
|
|
|
|
break;
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
(void) poll(NULL, 0, 100);
|
|
|
|
}
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_STATE, spa, RW_READER);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
tvd = spa->spa_root_vdev->vdev_child[top];
|
|
|
|
new_ms_count = tvd->vdev_ms_count;
|
|
|
|
new_class_space = metaslab_class_get_space(mc);
|
|
|
|
|
|
|
|
if (tvd->vdev_mg != mg || mg->mg_class != mc) {
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 5) {
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("Could not verify LUN expansion due to "
|
|
|
|
"intervening vdev offline or remove.\n");
|
|
|
|
}
|
|
|
|
spa_config_exit(spa, SCL_STATE, spa);
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_exit(&ztest_vdev_lock);
|
2016-12-17 01:11:29 +03:00
|
|
|
mutex_exit(&ztest_checkpoint_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Make sure we were able to grow the vdev.
|
|
|
|
*/
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
if (new_ms_count <= old_ms_count) {
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE,
|
|
|
|
"LUN expansion failed: ms_count %"PRIu64" < %"PRIu64"\n",
|
2010-05-29 00:45:14 +04:00
|
|
|
old_ms_count, new_ms_count);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
}
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Make sure we were able to grow the pool.
|
|
|
|
*/
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
if (new_class_space <= old_class_space) {
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE,
|
|
|
|
"LUN expansion failed: class_space %"PRIu64" < %"PRIu64"\n",
|
2010-05-29 00:45:14 +04:00
|
|
|
old_class_space, new_class_space);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 5) {
|
2017-06-13 12:16:45 +03:00
|
|
|
char oldnumbuf[NN_NUMBUF_SZ], newnumbuf[NN_NUMBUF_SZ];
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
nicenum(old_class_space, oldnumbuf, sizeof (oldnumbuf));
|
|
|
|
nicenum(new_class_space, newnumbuf, sizeof (newnumbuf));
|
2009-07-03 02:44:48 +04:00
|
|
|
(void) printf("%s grew from %s to %s\n",
|
|
|
|
spa->spa_name, oldnumbuf, newnumbuf);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
spa_config_exit(spa, SCL_STATE, spa);
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_exit(&ztest_vdev_lock);
|
2016-12-17 01:11:29 +03:00
|
|
|
mutex_exit(&ztest_checkpoint_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Verify that dmu_objset_{create,destroy,open,close} work as expected.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_objset_create_cb(objset_t *os, void *arg, cred_t *cr, dmu_tx_t *tx)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) arg, (void) cr;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Create the objects common to all ztest datasets.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(zap_create_claim(os, ZTEST_DIROBJ,
|
|
|
|
DMU_OT_ZAP_OTHER, DMU_OT_NONE, 0, tx));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static int
|
|
|
|
ztest_dataset_create(char *dsname)
|
|
|
|
{
|
2017-09-12 23:15:11 +03:00
|
|
|
int err;
|
|
|
|
uint64_t rand;
|
|
|
|
dsl_crypto_params_t *dcp = NULL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* 50% of the time, we create encrypted datasets
|
|
|
|
* using a random cipher suite and a hard-coded
|
|
|
|
* wrapping key.
|
|
|
|
*/
|
|
|
|
rand = ztest_random(2);
|
|
|
|
if (rand != 0) {
|
|
|
|
nvlist_t *crypto_args = fnvlist_alloc();
|
|
|
|
nvlist_t *props = fnvlist_alloc();
|
|
|
|
|
|
|
|
/* slight bias towards the default cipher suite */
|
|
|
|
rand = ztest_random(ZIO_CRYPT_FUNCTIONS);
|
|
|
|
if (rand < ZIO_CRYPT_AES_128_CCM)
|
|
|
|
rand = ZIO_CRYPT_ON;
|
|
|
|
|
|
|
|
fnvlist_add_uint64(props,
|
|
|
|
zfs_prop_to_name(ZFS_PROP_ENCRYPTION), rand);
|
|
|
|
fnvlist_add_uint8_array(crypto_args, "wkeydata",
|
|
|
|
(uint8_t *)ztest_wkeydata, WRAPPING_KEY_LEN);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* These parameters aren't really used by the kernel. They
|
|
|
|
* are simply stored so that userspace knows how to load
|
|
|
|
* the wrapping key.
|
|
|
|
*/
|
|
|
|
fnvlist_add_uint64(props,
|
|
|
|
zfs_prop_to_name(ZFS_PROP_KEYFORMAT), ZFS_KEYFORMAT_RAW);
|
|
|
|
fnvlist_add_string(props,
|
|
|
|
zfs_prop_to_name(ZFS_PROP_KEYLOCATION), "prompt");
|
|
|
|
fnvlist_add_uint64(props,
|
|
|
|
zfs_prop_to_name(ZFS_PROP_PBKDF2_SALT), 0ULL);
|
|
|
|
fnvlist_add_uint64(props,
|
|
|
|
zfs_prop_to_name(ZFS_PROP_PBKDF2_ITERS), 0ULL);
|
|
|
|
|
|
|
|
VERIFY0(dsl_crypto_params_create_nvlist(DCP_CMD_NONE, props,
|
|
|
|
crypto_args, &dcp));
|
|
|
|
|
2018-08-02 21:59:24 +03:00
|
|
|
/*
|
|
|
|
* Cycle through all available encryption implementations
|
|
|
|
* to verify interoperability.
|
|
|
|
*/
|
|
|
|
VERIFY0(gcm_impl_set("cycle"));
|
|
|
|
VERIFY0(aes_impl_set("cycle"));
|
|
|
|
|
2017-09-12 23:15:11 +03:00
|
|
|
fnvlist_free(crypto_args);
|
|
|
|
fnvlist_free(props);
|
|
|
|
}
|
|
|
|
|
|
|
|
err = dmu_objset_create(dsname, DMU_OST_OTHER, 0, dcp,
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_objset_create_cb, NULL);
|
2017-09-12 23:15:11 +03:00
|
|
|
dsl_crypto_params_free(dcp, !!err);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2017-09-12 23:15:11 +03:00
|
|
|
rand = ztest_random(100);
|
|
|
|
if (err || rand < 80)
|
2010-05-29 00:45:14 +04:00
|
|
|
return (err);
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 5)
|
2012-09-10 18:23:21 +04:00
|
|
|
(void) printf("Setting dataset %s to sync always\n", dsname);
|
2010-05-29 00:45:14 +04:00
|
|
|
return (ztest_dsl_prop_set_uint64(dsname, ZFS_PROP_SYNC,
|
|
|
|
ZFS_SYNC_ALWAYS, B_FALSE));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_objset_destroy_cb(const char *name, void *arg)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) arg;
|
2008-11-20 23:01:55 +03:00
|
|
|
objset_t *os;
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_object_info_t doi;
|
2008-11-20 23:01:55 +03:00
|
|
|
int error;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that the dataset contains a directory object.
|
|
|
|
*/
|
2017-09-12 23:15:11 +03:00
|
|
|
VERIFY0(ztest_dmu_objset_own(name, DMU_OST_OTHER, B_TRUE,
|
|
|
|
B_TRUE, FTAG, &os));
|
2010-05-29 00:45:14 +04:00
|
|
|
error = dmu_object_info(os, ZTEST_DIROBJ, &doi);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (error != ENOENT) {
|
|
|
|
/* We could have crashed in the middle of destroying it */
|
2013-05-11 01:17:03 +04:00
|
|
|
ASSERT0(error);
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT3U(doi.doi_type, ==, DMU_OT_ZAP_OTHER);
|
|
|
|
ASSERT3S(doi.doi_physical_blocks_512, >=, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dmu_objset_disown(os, B_TRUE, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Destroy the dataset.
|
|
|
|
*/
|
2013-09-04 16:00:57 +04:00
|
|
|
if (strchr(name, '@') != NULL) {
|
2022-02-26 22:20:32 +03:00
|
|
|
error = dsl_destroy_snapshot(name, B_TRUE);
|
|
|
|
if (error != ECHRNG) {
|
|
|
|
/*
|
|
|
|
* The program was executed, but encountered a runtime
|
|
|
|
* error, such as insufficient slop, or a hold on the
|
|
|
|
* dataset.
|
|
|
|
*/
|
|
|
|
ASSERT0(error);
|
|
|
|
}
|
2013-09-04 16:00:57 +04:00
|
|
|
} else {
|
2017-01-26 23:32:36 +03:00
|
|
|
error = dsl_destroy_head(name);
|
2018-12-14 21:03:05 +03:00
|
|
|
if (error == ENOSPC) {
|
|
|
|
/* There could be checkpoint or insufficient slop */
|
|
|
|
ztest_record_enospc(FTAG);
|
|
|
|
} else if (error != EBUSY) {
|
|
|
|
/* There could be a hold on this dataset */
|
2017-01-26 23:32:36 +03:00
|
|
|
ASSERT0(error);
|
2018-12-14 21:03:05 +03:00
|
|
|
}
|
2013-09-04 16:00:57 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static boolean_t
|
|
|
|
ztest_snapshot_create(char *osname, uint64_t id)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2016-06-16 00:28:36 +03:00
|
|
|
char snapname[ZFS_MAX_DATASET_NAME_LEN];
|
2010-05-29 00:45:14 +04:00
|
|
|
int error;
|
|
|
|
|
2021-06-05 15:18:37 +03:00
|
|
|
(void) snprintf(snapname, sizeof (snapname), "%"PRIu64"", id);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dmu_objset_snapshot_one(osname, snapname);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error == ENOSPC) {
|
|
|
|
ztest_record_enospc(FTAG);
|
|
|
|
return (B_FALSE);
|
|
|
|
}
|
2023-01-05 03:05:36 +03:00
|
|
|
if (error != 0 && error != EEXIST && error != ECHRNG) {
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "ztest_snapshot_create(%s@%s) = %d", osname,
|
2013-09-04 16:00:57 +04:00
|
|
|
snapname, error);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
return (B_TRUE);
|
|
|
|
}
|
|
|
|
|
|
|
|
static boolean_t
|
|
|
|
ztest_snapshot_destroy(char *osname, uint64_t id)
|
|
|
|
{
|
2016-06-16 00:28:36 +03:00
|
|
|
char snapname[ZFS_MAX_DATASET_NAME_LEN];
|
2010-05-29 00:45:14 +04:00
|
|
|
int error;
|
|
|
|
|
2021-06-05 15:18:37 +03:00
|
|
|
(void) snprintf(snapname, sizeof (snapname), "%s@%"PRIu64"",
|
|
|
|
osname, id);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dsl_destroy_snapshot(snapname, B_FALSE);
|
2023-01-05 03:05:36 +03:00
|
|
|
if (error != 0 && error != ENOENT && error != ECHRNG)
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "ztest_snapshot_destroy(%s) = %d",
|
|
|
|
snapname, error);
|
2010-05-29 00:45:14 +04:00
|
|
|
return (B_TRUE);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_dmu_objset_create_destroy(ztest_ds_t *zd, uint64_t id)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) zd;
|
2010-08-26 22:13:05 +04:00
|
|
|
ztest_ds_t *zdtmp;
|
2010-05-29 00:45:14 +04:00
|
|
|
int iters;
|
2008-11-20 23:01:55 +03:00
|
|
|
int error;
|
2008-12-03 23:09:06 +03:00
|
|
|
objset_t *os, *os2;
|
2016-06-16 00:28:36 +03:00
|
|
|
char name[ZFS_MAX_DATASET_NAME_LEN];
|
2008-11-20 23:01:55 +03:00
|
|
|
zilog_t *zilog;
|
2010-08-26 20:52:39 +04:00
|
|
|
int i;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
zdtmp = umem_alloc(sizeof (ztest_ds_t), UMEM_NOFAIL);
|
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_rdlock(&ztest_name_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-06-05 15:18:37 +03:00
|
|
|
(void) snprintf(name, sizeof (name), "%s/temp_%"PRIu64"",
|
|
|
|
ztest_opts.zo_pool, id);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If this dataset exists from a previous run, process its replay log
|
2013-09-04 16:00:57 +04:00
|
|
|
* half of the time. If we don't replay it, then dsl_destroy_head()
|
2010-05-29 00:45:14 +04:00
|
|
|
* (invoked from ztest_objset_destroy_cb()) should just throw it away.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
if (ztest_random(2) == 0 &&
|
2017-09-12 23:15:11 +03:00
|
|
|
ztest_dmu_objset_own(name, DMU_OST_OTHER, B_FALSE,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
B_TRUE, FTAG, &os) == 0) {
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_zd_init(zdtmp, NULL, os);
|
2010-08-26 22:13:05 +04:00
|
|
|
zil_replay(os, zdtmp, ztest_replay_vector);
|
|
|
|
ztest_zd_fini(zdtmp);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dmu_objset_disown(os, B_TRUE, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* There may be an old instance of the dataset we're about to
|
|
|
|
* create lying around from a previous run. If so, destroy it
|
|
|
|
* and all of its snapshots.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) dmu_objset_find(name, ztest_objset_destroy_cb, NULL,
|
2008-11-20 23:01:55 +03:00
|
|
|
DS_FIND_CHILDREN | DS_FIND_SNAPSHOTS);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that the destroyed dataset is no longer in the namespace.
|
2023-01-05 03:05:36 +03:00
|
|
|
* It may still be present if the destroy above fails with ENOSPC.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2023-01-05 03:05:36 +03:00
|
|
|
error = ztest_dmu_objset_own(name, DMU_OST_OTHER, B_TRUE, B_TRUE,
|
|
|
|
FTAG, &os);
|
|
|
|
if (error == 0) {
|
|
|
|
dmu_objset_disown(os, B_TRUE, FTAG);
|
|
|
|
ztest_record_enospc(FTAG);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
VERIFY3U(ENOENT, ==, error);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that we can create a new dataset.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
error = ztest_dataset_create(name);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (error) {
|
|
|
|
if (error == ENOSPC) {
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_record_enospc(FTAG);
|
2010-08-26 22:13:05 +04:00
|
|
|
goto out;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "dmu_objset_create(%s) = %d", name, error);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2017-09-12 23:15:11 +03:00
|
|
|
VERIFY0(ztest_dmu_objset_own(name, DMU_OST_OTHER, B_FALSE, B_TRUE,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
FTAG, &os));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_zd_init(zdtmp, NULL, os);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Open the intent log for it.
|
|
|
|
*/
|
2022-07-21 03:14:06 +03:00
|
|
|
zilog = zil_open(os, ztest_get_data, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Put some objects in there, do a little I/O to them,
|
|
|
|
* and randomly take a couple of snapshots along the way.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
iters = ztest_random(5);
|
2010-08-26 20:52:39 +04:00
|
|
|
for (i = 0; i < iters; i++) {
|
2010-08-26 22:13:05 +04:00
|
|
|
ztest_dmu_object_alloc_free(zdtmp, id);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (ztest_random(iters) == 0)
|
|
|
|
(void) ztest_snapshot_create(name, i);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that we cannot create an existing dataset.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
VERIFY3U(EEXIST, ==,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dmu_objset_create(name, DMU_OST_OTHER, 0, NULL, NULL, NULL));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Verify that we can hold an objset that is also owned.
|
2008-12-03 23:09:06 +03:00
|
|
|
*/
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_objset_hold(name, FTAG, &os2));
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_objset_rele(os2, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Verify that we cannot own an objset that is already owned.
|
|
|
|
*/
|
2017-09-12 23:15:11 +03:00
|
|
|
VERIFY3U(EBUSY, ==, ztest_dmu_objset_own(name, DMU_OST_OTHER,
|
|
|
|
B_FALSE, B_TRUE, FTAG, &os2));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
zil_close(zilog);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dmu_objset_disown(os, B_TRUE, FTAG);
|
2010-08-26 22:13:05 +04:00
|
|
|
ztest_zd_fini(zdtmp);
|
|
|
|
out:
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&ztest_name_lock);
|
2010-08-26 22:13:05 +04:00
|
|
|
|
|
|
|
umem_free(zdtmp, sizeof (ztest_ds_t));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that dmu_snapshot_{create,destroy,open,close} work as expected.
|
|
|
|
*/
|
|
|
|
void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_dmu_snapshot_create_destroy(ztest_ds_t *zd, uint64_t id)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_rdlock(&ztest_name_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) ztest_snapshot_destroy(zd->zd_name, id);
|
|
|
|
(void) ztest_snapshot_create(zd->zd_name, id);
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&ztest_name_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
/*
|
|
|
|
* Cleanup non-standard snapshots and clones.
|
|
|
|
*/
|
2020-06-15 21:30:37 +03:00
|
|
|
static void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_dsl_dataset_cleanup(char *osname, uint64_t id)
|
2009-07-03 02:44:48 +04:00
|
|
|
{
|
2010-08-26 22:13:05 +04:00
|
|
|
char *snap1name;
|
|
|
|
char *clone1name;
|
|
|
|
char *snap2name;
|
|
|
|
char *clone2name;
|
|
|
|
char *snap3name;
|
2009-07-03 02:44:48 +04:00
|
|
|
int error;
|
|
|
|
|
2016-06-16 00:28:36 +03:00
|
|
|
snap1name = umem_alloc(ZFS_MAX_DATASET_NAME_LEN, UMEM_NOFAIL);
|
|
|
|
clone1name = umem_alloc(ZFS_MAX_DATASET_NAME_LEN, UMEM_NOFAIL);
|
|
|
|
snap2name = umem_alloc(ZFS_MAX_DATASET_NAME_LEN, UMEM_NOFAIL);
|
|
|
|
clone2name = umem_alloc(ZFS_MAX_DATASET_NAME_LEN, UMEM_NOFAIL);
|
|
|
|
snap3name = umem_alloc(ZFS_MAX_DATASET_NAME_LEN, UMEM_NOFAIL);
|
|
|
|
|
2021-06-05 15:18:37 +03:00
|
|
|
(void) snprintf(snap1name, ZFS_MAX_DATASET_NAME_LEN, "%s@s1_%"PRIu64"",
|
|
|
|
osname, id);
|
|
|
|
(void) snprintf(clone1name, ZFS_MAX_DATASET_NAME_LEN, "%s/c1_%"PRIu64"",
|
|
|
|
osname, id);
|
|
|
|
(void) snprintf(snap2name, ZFS_MAX_DATASET_NAME_LEN, "%s@s2_%"PRIu64"",
|
|
|
|
clone1name, id);
|
|
|
|
(void) snprintf(clone2name, ZFS_MAX_DATASET_NAME_LEN, "%s/c2_%"PRIu64"",
|
|
|
|
osname, id);
|
|
|
|
(void) snprintf(snap3name, ZFS_MAX_DATASET_NAME_LEN, "%s@s3_%"PRIu64"",
|
|
|
|
clone1name, id);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dsl_destroy_head(clone2name);
|
2009-07-03 02:44:48 +04:00
|
|
|
if (error && error != ENOENT)
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "dsl_destroy_head(%s) = %d", clone2name, error);
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dsl_destroy_snapshot(snap3name, B_FALSE);
|
2009-07-03 02:44:48 +04:00
|
|
|
if (error && error != ENOENT)
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "dsl_destroy_snapshot(%s) = %d",
|
|
|
|
snap3name, error);
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dsl_destroy_snapshot(snap2name, B_FALSE);
|
2009-07-03 02:44:48 +04:00
|
|
|
if (error && error != ENOENT)
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "dsl_destroy_snapshot(%s) = %d",
|
|
|
|
snap2name, error);
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dsl_destroy_head(clone1name);
|
2009-07-03 02:44:48 +04:00
|
|
|
if (error && error != ENOENT)
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "dsl_destroy_head(%s) = %d", clone1name, error);
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dsl_destroy_snapshot(snap1name, B_FALSE);
|
2009-07-03 02:44:48 +04:00
|
|
|
if (error && error != ENOENT)
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "dsl_destroy_snapshot(%s) = %d",
|
|
|
|
snap1name, error);
|
2010-08-26 22:13:05 +04:00
|
|
|
|
2016-06-16 00:28:36 +03:00
|
|
|
umem_free(snap1name, ZFS_MAX_DATASET_NAME_LEN);
|
|
|
|
umem_free(clone1name, ZFS_MAX_DATASET_NAME_LEN);
|
|
|
|
umem_free(snap2name, ZFS_MAX_DATASET_NAME_LEN);
|
|
|
|
umem_free(clone2name, ZFS_MAX_DATASET_NAME_LEN);
|
|
|
|
umem_free(snap3name, ZFS_MAX_DATASET_NAME_LEN);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify dsl_dataset_promote handles EBUSY
|
|
|
|
*/
|
|
|
|
void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_dsl_dataset_promote_busy(ztest_ds_t *zd, uint64_t id)
|
2009-07-03 02:44:48 +04:00
|
|
|
{
|
2013-09-04 16:00:57 +04:00
|
|
|
objset_t *os;
|
2010-08-26 22:13:05 +04:00
|
|
|
char *snap1name;
|
|
|
|
char *clone1name;
|
|
|
|
char *snap2name;
|
|
|
|
char *clone2name;
|
|
|
|
char *snap3name;
|
2010-05-29 00:45:14 +04:00
|
|
|
char *osname = zd->zd_name;
|
|
|
|
int error;
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2016-06-16 00:28:36 +03:00
|
|
|
snap1name = umem_alloc(ZFS_MAX_DATASET_NAME_LEN, UMEM_NOFAIL);
|
|
|
|
clone1name = umem_alloc(ZFS_MAX_DATASET_NAME_LEN, UMEM_NOFAIL);
|
|
|
|
snap2name = umem_alloc(ZFS_MAX_DATASET_NAME_LEN, UMEM_NOFAIL);
|
|
|
|
clone2name = umem_alloc(ZFS_MAX_DATASET_NAME_LEN, UMEM_NOFAIL);
|
|
|
|
snap3name = umem_alloc(ZFS_MAX_DATASET_NAME_LEN, UMEM_NOFAIL);
|
2010-08-26 22:13:05 +04:00
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_rdlock(&ztest_name_lock);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_dsl_dataset_cleanup(osname, id);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2021-06-05 15:18:37 +03:00
|
|
|
(void) snprintf(snap1name, ZFS_MAX_DATASET_NAME_LEN, "%s@s1_%"PRIu64"",
|
|
|
|
osname, id);
|
|
|
|
(void) snprintf(clone1name, ZFS_MAX_DATASET_NAME_LEN, "%s/c1_%"PRIu64"",
|
|
|
|
osname, id);
|
|
|
|
(void) snprintf(snap2name, ZFS_MAX_DATASET_NAME_LEN, "%s@s2_%"PRIu64"",
|
|
|
|
clone1name, id);
|
|
|
|
(void) snprintf(clone2name, ZFS_MAX_DATASET_NAME_LEN, "%s/c2_%"PRIu64"",
|
|
|
|
osname, id);
|
|
|
|
(void) snprintf(snap3name, ZFS_MAX_DATASET_NAME_LEN, "%s@s3_%"PRIu64"",
|
|
|
|
clone1name, id);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2013-08-28 15:45:09 +04:00
|
|
|
error = dmu_objset_snapshot_one(osname, strchr(snap1name, '@') + 1);
|
2009-07-03 02:44:48 +04:00
|
|
|
if (error && error != EEXIST) {
|
|
|
|
if (error == ENOSPC) {
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_record_enospc(FTAG);
|
2009-07-03 02:44:48 +04:00
|
|
|
goto out;
|
|
|
|
}
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "dmu_take_snapshot(%s) = %d", snap1name, error);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dmu_objset_clone(clone1name, snap1name);
|
2009-07-03 02:44:48 +04:00
|
|
|
if (error) {
|
|
|
|
if (error == ENOSPC) {
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_record_enospc(FTAG);
|
2009-07-03 02:44:48 +04:00
|
|
|
goto out;
|
|
|
|
}
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "dmu_objset_create(%s) = %d", clone1name, error);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
|
2013-08-28 15:45:09 +04:00
|
|
|
error = dmu_objset_snapshot_one(clone1name, strchr(snap2name, '@') + 1);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error && error != EEXIST) {
|
|
|
|
if (error == ENOSPC) {
|
|
|
|
ztest_record_enospc(FTAG);
|
|
|
|
goto out;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "dmu_open_snapshot(%s) = %d", snap2name, error);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-08-28 15:45:09 +04:00
|
|
|
error = dmu_objset_snapshot_one(clone1name, strchr(snap3name, '@') + 1);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error && error != EEXIST) {
|
|
|
|
if (error == ENOSPC) {
|
|
|
|
ztest_record_enospc(FTAG);
|
|
|
|
goto out;
|
|
|
|
}
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "dmu_open_snapshot(%s) = %d", snap3name, error);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dmu_objset_clone(clone2name, snap3name);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error) {
|
|
|
|
if (error == ENOSPC) {
|
|
|
|
ztest_record_enospc(FTAG);
|
|
|
|
goto out;
|
|
|
|
}
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "dmu_objset_create(%s) = %d", clone2name, error);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-09-12 23:15:11 +03:00
|
|
|
error = ztest_dmu_objset_own(snap2name, DMU_OST_ANY, B_TRUE, B_TRUE,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
FTAG, &os);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error)
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "dmu_objset_own(%s) = %d", snap2name, error);
|
2010-05-29 00:45:14 +04:00
|
|
|
error = dsl_dataset_promote(clone2name, NULL);
|
2014-06-06 01:19:08 +04:00
|
|
|
if (error == ENOSPC) {
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dmu_objset_disown(os, B_TRUE, FTAG);
|
2014-06-06 01:19:08 +04:00
|
|
|
ztest_record_enospc(FTAG);
|
|
|
|
goto out;
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error != EBUSY)
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "dsl_dataset_promote(%s), %d, not EBUSY",
|
|
|
|
clone2name, error);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dmu_objset_disown(os, B_TRUE, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
out:
|
|
|
|
ztest_dsl_dataset_cleanup(osname, id);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&ztest_name_lock);
|
2010-08-26 22:13:05 +04:00
|
|
|
|
2016-06-16 00:28:36 +03:00
|
|
|
umem_free(snap1name, ZFS_MAX_DATASET_NAME_LEN);
|
|
|
|
umem_free(clone1name, ZFS_MAX_DATASET_NAME_LEN);
|
|
|
|
umem_free(snap2name, ZFS_MAX_DATASET_NAME_LEN);
|
|
|
|
umem_free(clone2name, ZFS_MAX_DATASET_NAME_LEN);
|
|
|
|
umem_free(snap3name, ZFS_MAX_DATASET_NAME_LEN);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
#undef OD_ARRAY_SIZE
|
2013-11-01 23:26:11 +04:00
|
|
|
#define OD_ARRAY_SIZE 4
|
2010-08-26 22:13:05 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Verify that dmu_object_{alloc,free} work as expected.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ztest_dmu_object_alloc_free(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
2010-08-26 22:13:05 +04:00
|
|
|
ztest_od_t *od;
|
|
|
|
int batchsize;
|
|
|
|
int size;
|
2010-08-26 20:52:39 +04:00
|
|
|
int b;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
size = sizeof (ztest_od_t) * OD_ARRAY_SIZE;
|
2010-08-26 22:13:05 +04:00
|
|
|
od = umem_alloc(size, UMEM_NOFAIL);
|
|
|
|
batchsize = OD_ARRAY_SIZE;
|
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (b = 0; b < batchsize; b++)
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_od_init(od + b, id, FTAG, b, DMU_OT_UINT64_OTHER,
|
|
|
|
0, 0, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Destroy the previous batch of objects, create a new batch,
|
|
|
|
* and do some I/O on the new objects.
|
|
|
|
*/
|
2023-03-07 02:30:29 +03:00
|
|
|
if (ztest_object_init(zd, od, size, B_TRUE) != 0) {
|
|
|
|
zd->zd_od = NULL;
|
|
|
|
umem_free(od, size);
|
2010-05-29 00:45:14 +04:00
|
|
|
return;
|
2023-03-07 02:30:29 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
while (ztest_random(4 * batchsize) != 0)
|
|
|
|
ztest_io(zd, od[ztest_random(batchsize)].od_object,
|
|
|
|
ztest_random(ZTEST_RANGE_LOCKS) << SPA_MAXBLOCKSHIFT);
|
2010-08-26 22:13:05 +04:00
|
|
|
|
|
|
|
umem_free(od, size);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2018-01-19 12:19:47 +03:00
|
|
|
/*
|
|
|
|
* Rewind the global allocator to verify object allocation backfilling.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ztest_dmu_object_next_chunk(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) id;
|
2018-01-19 12:19:47 +03:00
|
|
|
objset_t *os = zd->zd_os;
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
uint_t dnodes_per_chunk = 1 << dmu_object_alloc_chunk_shift;
|
2018-01-19 12:19:47 +03:00
|
|
|
uint64_t object;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Rewind the global allocator randomly back to a lower object number
|
|
|
|
* to force backfilling and reclamation of recently freed dnodes.
|
|
|
|
*/
|
|
|
|
mutex_enter(&os->os_obj_lock);
|
|
|
|
object = ztest_random(os->os_obj_next_chunk);
|
2024-05-10 18:47:21 +03:00
|
|
|
os->os_obj_next_chunk = P2ALIGN_TYPED(object, dnodes_per_chunk,
|
|
|
|
uint64_t);
|
2018-01-19 12:19:47 +03:00
|
|
|
mutex_exit(&os->os_obj_lock);
|
|
|
|
}
|
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
#undef OD_ARRAY_SIZE
|
2013-11-01 23:26:11 +04:00
|
|
|
#define OD_ARRAY_SIZE 2
|
2010-08-26 22:13:05 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Verify that dmu_{read,write} work as expected.
|
|
|
|
*/
|
|
|
|
void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_dmu_read_write(ztest_ds_t *zd, uint64_t id)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-08-26 22:13:05 +04:00
|
|
|
int size;
|
|
|
|
ztest_od_t *od;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *os = zd->zd_os;
|
2013-11-01 23:26:11 +04:00
|
|
|
size = sizeof (ztest_od_t) * OD_ARRAY_SIZE;
|
2010-08-26 22:13:05 +04:00
|
|
|
od = umem_alloc(size, UMEM_NOFAIL);
|
2008-11-20 23:01:55 +03:00
|
|
|
dmu_tx_t *tx;
|
2021-06-05 15:18:37 +03:00
|
|
|
int freeit, error;
|
|
|
|
uint64_t i, n, s, txg;
|
2008-11-20 23:01:55 +03:00
|
|
|
bufwad_t *packbuf, *bigbuf, *pack, *bigH, *bigT;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t packobj, packoff, packsize, bigobj, bigoff, bigsize;
|
|
|
|
uint64_t chunksize = (1000 + ztest_random(1000)) * sizeof (uint64_t);
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t regions = 997;
|
|
|
|
uint64_t stride = 123456789ULL;
|
|
|
|
uint64_t width = 40;
|
|
|
|
int free_percent = 5;
|
Adding Direct IO Support
Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.
O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated TXG is synced.
For both O_DIRECT read and write request the offset and request sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned unless the direct property is set to always (see
below).
For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,
then it will be discarded from the ARC forcing all further reads to
retrieve the data from disk.
For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event
that the requested data is in buffered (in the ARC) it will just be
copied from the ARC into the user buffer.
For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in
the event that file contents are mmap'ed. In this case, all requests
that are at least PAGE_SIZE aligned will just fall back to the buffered
paths. If the request however is not PAGE_SIZE aligned, EINVAL will
be returned as always regardless if the file's contents are mmap'ed.
Since O_DIRECT writes go through the normal ZIO pipeline, the
following operations are supported just as with normal buffered writes:
Checksum
Compression
Encryption
Erasure Coding
There is one caveat for the data integrity of O_DIRECT writes that is
distinct for each of the OS's supported by ZFS.
FreeBSD - FreeBSD is able to place user pages under write protection so
any data in the user buffers and written directly down to the
VDEV disks is guaranteed to not change. There is no concern
with data integrity and O_DIRECT writes.
Linux - Linux is not able to place anonymous user pages under write
protection. Because of this, if the user decides to manipulate
the page contents while the write operation is occurring, data
integrity can not be guaranteed. However, there is a module
parameter `zfs_vdev_direct_write_verify` that controls the
if a O_DIRECT writes that can occur to a top-level VDEV before
a checksum verify is run before the contents of the I/O buffer
are committed to disk. In the event of a checksum verification
failure the write will return EIO. The number of O_DIRECT write
checksum verification errors can be observed by doing
`zpool status -d`, which will list all verification errors that
have occurred on a top-level VDEV. Along with `zpool status`, a
ZED event will be issues as `dio_verify` when a checksum
verification error occurs.
ZVOLs and dedup is not currently supported with Direct I/O.
A new dataset property `direct` has been added with the following 3
allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats
the request as a buffered IO request.
standard - Follows the alignment restrictions outlined above for
write/read IO requests when the O_DIRECT flag is used.
always - Treats every write/read IO request as though it passed
O_DIRECT and will do O_DIRECT if the alignment restrictions
are met otherwise will redirect through the ARC. This
property will not allow a request to fail.
There is also a module parameter zfs_dio_enabled that can be used to
force all reads and writes through the ARC. By setting this module
parameter to 0, it mimics as if the direct dataset property is set to
disabled.
Reviewed-by: Brian Behlendorf <behlendorf@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Co-authored-by: Mark Maybee <mark.maybee@delphix.com>
Co-authored-by: Matt Macy <mmacy@FreeBSD.org>
Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov>
Closes #10018
2024-09-14 23:47:59 +03:00
|
|
|
uint32_t dmu_read_flags = DMU_READ_PREFETCH;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We will randomly set when to do O_DIRECT on a read.
|
|
|
|
*/
|
|
|
|
if (ztest_random(4) == 0)
|
|
|
|
dmu_read_flags |= DMU_DIRECTIO;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* This test uses two objects, packobj and bigobj, that are always
|
|
|
|
* updated together (i.e. in the same tx) so that their contents are
|
|
|
|
* in sync and can be compared. Their contents relate to each other
|
|
|
|
* in a simple way: packobj is a dense array of 'bufwad' structures,
|
|
|
|
* while bigobj is a sparse array of the same bufwads. Specifically,
|
|
|
|
* for any index n, there are three bufwads that should be identical:
|
|
|
|
*
|
|
|
|
* packobj, at offset n * sizeof (bufwad_t)
|
|
|
|
* bigobj, at the head of the nth chunk
|
|
|
|
* bigobj, at the tail of the nth chunk
|
|
|
|
*
|
|
|
|
* The chunk size is arbitrary. It doesn't have to be a power of two,
|
|
|
|
* and it doesn't have any relation to the object blocksize.
|
|
|
|
* The only requirement is that it can hold at least two bufwads.
|
|
|
|
*
|
|
|
|
* Normally, we write the bufwad to each of these locations.
|
|
|
|
* However, free_percent of the time we instead write zeroes to
|
|
|
|
* packobj and perform a dmu_free_range() on bigobj. By comparing
|
|
|
|
* bigobj to packobj, we can verify that the DMU is correctly
|
|
|
|
* tracking which parts of an object are allocated and free,
|
|
|
|
* and that the contents of the allocated blocks are correct.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Read the directory info. If it's the first time, set things up.
|
|
|
|
*/
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_od_init(od, id, FTAG, 0, DMU_OT_UINT64_OTHER, 0, 0, chunksize);
|
|
|
|
ztest_od_init(od + 1, id, FTAG, 1, DMU_OT_UINT64_OTHER, 0, 0,
|
|
|
|
chunksize);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
if (ztest_object_init(zd, od, size, B_FALSE) != 0) {
|
|
|
|
umem_free(od, size);
|
2010-05-29 00:45:14 +04:00
|
|
|
return;
|
2010-08-26 22:13:05 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
bigobj = od[0].od_object;
|
|
|
|
packobj = od[1].od_object;
|
|
|
|
chunksize = od[0].od_gen;
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(chunksize, ==, od[1].od_gen);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Prefetch a random chunk of the big object.
|
|
|
|
* Our aim here is to get some async reads in flight
|
|
|
|
* for blocks that we may free below; the DMU should
|
|
|
|
* handle this race correctly.
|
|
|
|
*/
|
|
|
|
n = ztest_random(regions) * stride + ztest_random(width);
|
|
|
|
s = 1 + ztest_random(2 * width - 1);
|
2015-12-22 04:31:57 +03:00
|
|
|
dmu_prefetch(os, bigobj, 0, n * chunksize, s * chunksize,
|
|
|
|
ZIO_PRIORITY_SYNC_READ);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Pick a random index and compute the offsets into packobj and bigobj.
|
|
|
|
*/
|
|
|
|
n = ztest_random(regions) * stride + ztest_random(width);
|
|
|
|
s = 1 + ztest_random(width - 1);
|
|
|
|
|
|
|
|
packoff = n * sizeof (bufwad_t);
|
|
|
|
packsize = s * sizeof (bufwad_t);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
bigoff = n * chunksize;
|
|
|
|
bigsize = s * chunksize;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
packbuf = umem_alloc(packsize, UMEM_NOFAIL);
|
|
|
|
bigbuf = umem_alloc(bigsize, UMEM_NOFAIL);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* free_percent of the time, free a range of bigobj rather than
|
|
|
|
* overwriting it.
|
|
|
|
*/
|
|
|
|
freeit = (ztest_random(100) < free_percent);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Read the current contents of our objects.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
error = dmu_read(os, packobj, packoff, packsize, packbuf,
|
Adding Direct IO Support
Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.
O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated TXG is synced.
For both O_DIRECT read and write request the offset and request sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned unless the direct property is set to always (see
below).
For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,
then it will be discarded from the ARC forcing all further reads to
retrieve the data from disk.
For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event
that the requested data is in buffered (in the ARC) it will just be
copied from the ARC into the user buffer.
For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in
the event that file contents are mmap'ed. In this case, all requests
that are at least PAGE_SIZE aligned will just fall back to the buffered
paths. If the request however is not PAGE_SIZE aligned, EINVAL will
be returned as always regardless if the file's contents are mmap'ed.
Since O_DIRECT writes go through the normal ZIO pipeline, the
following operations are supported just as with normal buffered writes:
Checksum
Compression
Encryption
Erasure Coding
There is one caveat for the data integrity of O_DIRECT writes that is
distinct for each of the OS's supported by ZFS.
FreeBSD - FreeBSD is able to place user pages under write protection so
any data in the user buffers and written directly down to the
VDEV disks is guaranteed to not change. There is no concern
with data integrity and O_DIRECT writes.
Linux - Linux is not able to place anonymous user pages under write
protection. Because of this, if the user decides to manipulate
the page contents while the write operation is occurring, data
integrity can not be guaranteed. However, there is a module
parameter `zfs_vdev_direct_write_verify` that controls the
if a O_DIRECT writes that can occur to a top-level VDEV before
a checksum verify is run before the contents of the I/O buffer
are committed to disk. In the event of a checksum verification
failure the write will return EIO. The number of O_DIRECT write
checksum verification errors can be observed by doing
`zpool status -d`, which will list all verification errors that
have occurred on a top-level VDEV. Along with `zpool status`, a
ZED event will be issues as `dio_verify` when a checksum
verification error occurs.
ZVOLs and dedup is not currently supported with Direct I/O.
A new dataset property `direct` has been added with the following 3
allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats
the request as a buffered IO request.
standard - Follows the alignment restrictions outlined above for
write/read IO requests when the O_DIRECT flag is used.
always - Treats every write/read IO request as though it passed
O_DIRECT and will do O_DIRECT if the alignment restrictions
are met otherwise will redirect through the ARC. This
property will not allow a request to fail.
There is also a module parameter zfs_dio_enabled that can be used to
force all reads and writes through the ARC. By setting this module
parameter to 0, it mimics as if the direct dataset property is set to
disabled.
Reviewed-by: Brian Behlendorf <behlendorf@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Co-authored-by: Mark Maybee <mark.maybee@delphix.com>
Co-authored-by: Matt Macy <mmacy@FreeBSD.org>
Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov>
Closes #10018
2024-09-14 23:47:59 +03:00
|
|
|
dmu_read_flags);
|
2013-05-11 01:17:03 +04:00
|
|
|
ASSERT0(error);
|
2010-05-29 00:45:14 +04:00
|
|
|
error = dmu_read(os, bigobj, bigoff, bigsize, bigbuf,
|
Adding Direct IO Support
Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.
O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated TXG is synced.
For both O_DIRECT read and write request the offset and request sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned unless the direct property is set to always (see
below).
For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,
then it will be discarded from the ARC forcing all further reads to
retrieve the data from disk.
For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event
that the requested data is in buffered (in the ARC) it will just be
copied from the ARC into the user buffer.
For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in
the event that file contents are mmap'ed. In this case, all requests
that are at least PAGE_SIZE aligned will just fall back to the buffered
paths. If the request however is not PAGE_SIZE aligned, EINVAL will
be returned as always regardless if the file's contents are mmap'ed.
Since O_DIRECT writes go through the normal ZIO pipeline, the
following operations are supported just as with normal buffered writes:
Checksum
Compression
Encryption
Erasure Coding
There is one caveat for the data integrity of O_DIRECT writes that is
distinct for each of the OS's supported by ZFS.
FreeBSD - FreeBSD is able to place user pages under write protection so
any data in the user buffers and written directly down to the
VDEV disks is guaranteed to not change. There is no concern
with data integrity and O_DIRECT writes.
Linux - Linux is not able to place anonymous user pages under write
protection. Because of this, if the user decides to manipulate
the page contents while the write operation is occurring, data
integrity can not be guaranteed. However, there is a module
parameter `zfs_vdev_direct_write_verify` that controls the
if a O_DIRECT writes that can occur to a top-level VDEV before
a checksum verify is run before the contents of the I/O buffer
are committed to disk. In the event of a checksum verification
failure the write will return EIO. The number of O_DIRECT write
checksum verification errors can be observed by doing
`zpool status -d`, which will list all verification errors that
have occurred on a top-level VDEV. Along with `zpool status`, a
ZED event will be issues as `dio_verify` when a checksum
verification error occurs.
ZVOLs and dedup is not currently supported with Direct I/O.
A new dataset property `direct` has been added with the following 3
allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats
the request as a buffered IO request.
standard - Follows the alignment restrictions outlined above for
write/read IO requests when the O_DIRECT flag is used.
always - Treats every write/read IO request as though it passed
O_DIRECT and will do O_DIRECT if the alignment restrictions
are met otherwise will redirect through the ARC. This
property will not allow a request to fail.
There is also a module parameter zfs_dio_enabled that can be used to
force all reads and writes through the ARC. By setting this module
parameter to 0, it mimics as if the direct dataset property is set to
disabled.
Reviewed-by: Brian Behlendorf <behlendorf@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Co-authored-by: Mark Maybee <mark.maybee@delphix.com>
Co-authored-by: Matt Macy <mmacy@FreeBSD.org>
Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov>
Closes #10018
2024-09-14 23:47:59 +03:00
|
|
|
dmu_read_flags);
|
2013-05-11 01:17:03 +04:00
|
|
|
ASSERT0(error);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Get a tx for the mods to both packobj and bigobj.
|
|
|
|
*/
|
|
|
|
tx = dmu_tx_create(os);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_tx_hold_write(tx, packobj, packoff, packsize);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (freeit)
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_tx_hold_free(tx, bigobj, bigoff, bigsize);
|
2008-11-20 23:01:55 +03:00
|
|
|
else
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_tx_hold_write(tx, bigobj, bigoff, bigsize);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-08-07 22:32:46 +04:00
|
|
|
/* This accounts for setting the checksum/compression. */
|
|
|
|
dmu_tx_hold_bonus(tx, bigobj);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
txg = ztest_tx_assign(tx, TXG_MIGHTWAIT, FTAG);
|
|
|
|
if (txg == 0) {
|
2008-11-20 23:01:55 +03:00
|
|
|
umem_free(packbuf, packsize);
|
|
|
|
umem_free(bigbuf, bigsize);
|
2010-08-26 22:13:05 +04:00
|
|
|
umem_free(od, size);
|
2008-11-20 23:01:55 +03:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
enum zio_checksum cksum;
|
|
|
|
do {
|
|
|
|
cksum = (enum zio_checksum)
|
|
|
|
ztest_random_dsl_prop(ZFS_PROP_CHECKSUM);
|
|
|
|
} while (cksum >= ZIO_CHECKSUM_LEGACY_FUNCTIONS);
|
|
|
|
dmu_object_set_checksum(os, bigobj, cksum, tx);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
enum zio_compress comp;
|
|
|
|
do {
|
|
|
|
comp = (enum zio_compress)
|
|
|
|
ztest_random_dsl_prop(ZFS_PROP_COMPRESSION);
|
|
|
|
} while (comp >= ZIO_COMPRESS_LEGACY_FUNCTIONS);
|
|
|
|
dmu_object_set_compress(os, bigobj, comp, tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* For each index from n to n + s, verify that the existing bufwad
|
|
|
|
* in packobj matches the bufwads at the head and tail of the
|
|
|
|
* corresponding chunk in bigobj. Then update all three bufwads
|
|
|
|
* with the new values we want to write out.
|
|
|
|
*/
|
|
|
|
for (i = 0; i < s; i++) {
|
|
|
|
/* LINTED */
|
|
|
|
pack = (bufwad_t *)((char *)packbuf + i * sizeof (bufwad_t));
|
|
|
|
/* LINTED */
|
2010-05-29 00:45:14 +04:00
|
|
|
bigH = (bufwad_t *)((char *)bigbuf + i * chunksize);
|
2008-11-20 23:01:55 +03:00
|
|
|
/* LINTED */
|
2010-05-29 00:45:14 +04:00
|
|
|
bigT = (bufwad_t *)((char *)bigH + chunksize) - 1;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U((uintptr_t)bigH - (uintptr_t)bigbuf, <, bigsize);
|
|
|
|
ASSERT3U((uintptr_t)bigT - (uintptr_t)bigbuf, <, bigsize);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (pack->bw_txg > txg)
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE,
|
|
|
|
"future leak: got %"PRIx64", open txg is %"PRIx64"",
|
2008-11-20 23:01:55 +03:00
|
|
|
pack->bw_txg, txg);
|
|
|
|
|
|
|
|
if (pack->bw_data != 0 && pack->bw_index != n + i)
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "wrong index: "
|
|
|
|
"got %"PRIx64", wanted %"PRIx64"+%"PRIx64"",
|
2008-11-20 23:01:55 +03:00
|
|
|
pack->bw_index, n, i);
|
|
|
|
|
2022-02-25 16:26:54 +03:00
|
|
|
if (memcmp(pack, bigH, sizeof (bufwad_t)) != 0)
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "pack/bigH mismatch in %p/%p",
|
|
|
|
pack, bigH);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2022-02-25 16:26:54 +03:00
|
|
|
if (memcmp(pack, bigT, sizeof (bufwad_t)) != 0)
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "pack/bigT mismatch in %p/%p",
|
|
|
|
pack, bigT);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (freeit) {
|
2022-02-25 16:26:54 +03:00
|
|
|
memset(pack, 0, sizeof (bufwad_t));
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
|
|
|
pack->bw_index = n + i;
|
|
|
|
pack->bw_txg = txg;
|
|
|
|
pack->bw_data = 1 + ztest_random(-2ULL);
|
|
|
|
}
|
|
|
|
*bigH = *pack;
|
|
|
|
*bigT = *pack;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We've verified all the old bufwads, and made new ones.
|
|
|
|
* Now write them out.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_write(os, packobj, packoff, packsize, packbuf, tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (freeit) {
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 7) {
|
2021-06-05 15:18:37 +03:00
|
|
|
(void) printf("freeing offset %"PRIx64" size %"PRIx64""
|
|
|
|
" txg %"PRIx64"\n",
|
|
|
|
bigoff, bigsize, txg);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_free_range(os, bigobj, bigoff, bigsize, tx));
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 7) {
|
2021-06-05 15:18:37 +03:00
|
|
|
(void) printf("writing offset %"PRIx64" size %"PRIx64""
|
|
|
|
" txg %"PRIx64"\n",
|
|
|
|
bigoff, bigsize, txg);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_write(os, bigobj, bigoff, bigsize, bigbuf, tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
dmu_tx_commit(tx);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Sanity check the stuff we just wrote.
|
|
|
|
*/
|
|
|
|
{
|
|
|
|
void *packcheck = umem_alloc(packsize, UMEM_NOFAIL);
|
|
|
|
void *bigcheck = umem_alloc(bigsize, UMEM_NOFAIL);
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_read(os, packobj, packoff,
|
Adding Direct IO Support
Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.
O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated TXG is synced.
For both O_DIRECT read and write request the offset and request sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned unless the direct property is set to always (see
below).
For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,
then it will be discarded from the ARC forcing all further reads to
retrieve the data from disk.
For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event
that the requested data is in buffered (in the ARC) it will just be
copied from the ARC into the user buffer.
For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in
the event that file contents are mmap'ed. In this case, all requests
that are at least PAGE_SIZE aligned will just fall back to the buffered
paths. If the request however is not PAGE_SIZE aligned, EINVAL will
be returned as always regardless if the file's contents are mmap'ed.
Since O_DIRECT writes go through the normal ZIO pipeline, the
following operations are supported just as with normal buffered writes:
Checksum
Compression
Encryption
Erasure Coding
There is one caveat for the data integrity of O_DIRECT writes that is
distinct for each of the OS's supported by ZFS.
FreeBSD - FreeBSD is able to place user pages under write protection so
any data in the user buffers and written directly down to the
VDEV disks is guaranteed to not change. There is no concern
with data integrity and O_DIRECT writes.
Linux - Linux is not able to place anonymous user pages under write
protection. Because of this, if the user decides to manipulate
the page contents while the write operation is occurring, data
integrity can not be guaranteed. However, there is a module
parameter `zfs_vdev_direct_write_verify` that controls the
if a O_DIRECT writes that can occur to a top-level VDEV before
a checksum verify is run before the contents of the I/O buffer
are committed to disk. In the event of a checksum verification
failure the write will return EIO. The number of O_DIRECT write
checksum verification errors can be observed by doing
`zpool status -d`, which will list all verification errors that
have occurred on a top-level VDEV. Along with `zpool status`, a
ZED event will be issues as `dio_verify` when a checksum
verification error occurs.
ZVOLs and dedup is not currently supported with Direct I/O.
A new dataset property `direct` has been added with the following 3
allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats
the request as a buffered IO request.
standard - Follows the alignment restrictions outlined above for
write/read IO requests when the O_DIRECT flag is used.
always - Treats every write/read IO request as though it passed
O_DIRECT and will do O_DIRECT if the alignment restrictions
are met otherwise will redirect through the ARC. This
property will not allow a request to fail.
There is also a module parameter zfs_dio_enabled that can be used to
force all reads and writes through the ARC. By setting this module
parameter to 0, it mimics as if the direct dataset property is set to
disabled.
Reviewed-by: Brian Behlendorf <behlendorf@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Co-authored-by: Mark Maybee <mark.maybee@delphix.com>
Co-authored-by: Matt Macy <mmacy@FreeBSD.org>
Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov>
Closes #10018
2024-09-14 23:47:59 +03:00
|
|
|
packsize, packcheck, dmu_read_flags));
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_read(os, bigobj, bigoff,
|
Adding Direct IO Support
Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.
O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated TXG is synced.
For both O_DIRECT read and write request the offset and request sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned unless the direct property is set to always (see
below).
For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,
then it will be discarded from the ARC forcing all further reads to
retrieve the data from disk.
For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event
that the requested data is in buffered (in the ARC) it will just be
copied from the ARC into the user buffer.
For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in
the event that file contents are mmap'ed. In this case, all requests
that are at least PAGE_SIZE aligned will just fall back to the buffered
paths. If the request however is not PAGE_SIZE aligned, EINVAL will
be returned as always regardless if the file's contents are mmap'ed.
Since O_DIRECT writes go through the normal ZIO pipeline, the
following operations are supported just as with normal buffered writes:
Checksum
Compression
Encryption
Erasure Coding
There is one caveat for the data integrity of O_DIRECT writes that is
distinct for each of the OS's supported by ZFS.
FreeBSD - FreeBSD is able to place user pages under write protection so
any data in the user buffers and written directly down to the
VDEV disks is guaranteed to not change. There is no concern
with data integrity and O_DIRECT writes.
Linux - Linux is not able to place anonymous user pages under write
protection. Because of this, if the user decides to manipulate
the page contents while the write operation is occurring, data
integrity can not be guaranteed. However, there is a module
parameter `zfs_vdev_direct_write_verify` that controls the
if a O_DIRECT writes that can occur to a top-level VDEV before
a checksum verify is run before the contents of the I/O buffer
are committed to disk. In the event of a checksum verification
failure the write will return EIO. The number of O_DIRECT write
checksum verification errors can be observed by doing
`zpool status -d`, which will list all verification errors that
have occurred on a top-level VDEV. Along with `zpool status`, a
ZED event will be issues as `dio_verify` when a checksum
verification error occurs.
ZVOLs and dedup is not currently supported with Direct I/O.
A new dataset property `direct` has been added with the following 3
allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats
the request as a buffered IO request.
standard - Follows the alignment restrictions outlined above for
write/read IO requests when the O_DIRECT flag is used.
always - Treats every write/read IO request as though it passed
O_DIRECT and will do O_DIRECT if the alignment restrictions
are met otherwise will redirect through the ARC. This
property will not allow a request to fail.
There is also a module parameter zfs_dio_enabled that can be used to
force all reads and writes through the ARC. By setting this module
parameter to 0, it mimics as if the direct dataset property is set to
disabled.
Reviewed-by: Brian Behlendorf <behlendorf@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Co-authored-by: Mark Maybee <mark.maybee@delphix.com>
Co-authored-by: Matt Macy <mmacy@FreeBSD.org>
Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov>
Closes #10018
2024-09-14 23:47:59 +03:00
|
|
|
bigsize, bigcheck, dmu_read_flags));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2022-02-25 16:26:54 +03:00
|
|
|
ASSERT0(memcmp(packbuf, packcheck, packsize));
|
|
|
|
ASSERT0(memcmp(bigbuf, bigcheck, bigsize));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
umem_free(packcheck, packsize);
|
|
|
|
umem_free(bigcheck, bigsize);
|
|
|
|
}
|
|
|
|
|
|
|
|
umem_free(packbuf, packsize);
|
|
|
|
umem_free(bigbuf, bigsize);
|
2010-08-26 22:13:05 +04:00
|
|
|
umem_free(od, size);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2020-06-15 21:30:37 +03:00
|
|
|
static void
|
2009-07-03 02:44:48 +04:00
|
|
|
compare_and_update_pbbufs(uint64_t s, bufwad_t *packbuf, bufwad_t *bigbuf,
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t bigsize, uint64_t n, uint64_t chunksize, uint64_t txg)
|
2009-07-03 02:44:48 +04:00
|
|
|
{
|
|
|
|
uint64_t i;
|
|
|
|
bufwad_t *pack;
|
|
|
|
bufwad_t *bigH;
|
|
|
|
bufwad_t *bigT;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* For each index from n to n + s, verify that the existing bufwad
|
|
|
|
* in packobj matches the bufwads at the head and tail of the
|
|
|
|
* corresponding chunk in bigobj. Then update all three bufwads
|
|
|
|
* with the new values we want to write out.
|
|
|
|
*/
|
|
|
|
for (i = 0; i < s; i++) {
|
|
|
|
/* LINTED */
|
|
|
|
pack = (bufwad_t *)((char *)packbuf + i * sizeof (bufwad_t));
|
|
|
|
/* LINTED */
|
2010-05-29 00:45:14 +04:00
|
|
|
bigH = (bufwad_t *)((char *)bigbuf + i * chunksize);
|
2009-07-03 02:44:48 +04:00
|
|
|
/* LINTED */
|
2010-05-29 00:45:14 +04:00
|
|
|
bigT = (bufwad_t *)((char *)bigH + chunksize) - 1;
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U((uintptr_t)bigH - (uintptr_t)bigbuf, <, bigsize);
|
|
|
|
ASSERT3U((uintptr_t)bigT - (uintptr_t)bigbuf, <, bigsize);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
if (pack->bw_txg > txg)
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE,
|
|
|
|
"future leak: got %"PRIx64", open txg is %"PRIx64"",
|
2009-07-03 02:44:48 +04:00
|
|
|
pack->bw_txg, txg);
|
|
|
|
|
|
|
|
if (pack->bw_data != 0 && pack->bw_index != n + i)
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "wrong index: "
|
|
|
|
"got %"PRIx64", wanted %"PRIx64"+%"PRIx64"",
|
2009-07-03 02:44:48 +04:00
|
|
|
pack->bw_index, n, i);
|
|
|
|
|
2022-02-25 16:26:54 +03:00
|
|
|
if (memcmp(pack, bigH, sizeof (bufwad_t)) != 0)
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "pack/bigH mismatch in %p/%p",
|
|
|
|
pack, bigH);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2022-02-25 16:26:54 +03:00
|
|
|
if (memcmp(pack, bigT, sizeof (bufwad_t)) != 0)
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "pack/bigT mismatch in %p/%p",
|
|
|
|
pack, bigT);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
pack->bw_index = n + i;
|
|
|
|
pack->bw_txg = txg;
|
|
|
|
pack->bw_data = 1 + ztest_random(-2ULL);
|
|
|
|
|
|
|
|
*bigH = *pack;
|
|
|
|
*bigT = *pack;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
#undef OD_ARRAY_SIZE
|
2013-11-01 23:26:11 +04:00
|
|
|
#define OD_ARRAY_SIZE 2
|
2010-08-26 22:13:05 +04:00
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_dmu_read_write_zcopy(ztest_ds_t *zd, uint64_t id)
|
2009-07-03 02:44:48 +04:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *os = zd->zd_os;
|
2010-08-26 22:13:05 +04:00
|
|
|
ztest_od_t *od;
|
2009-07-03 02:44:48 +04:00
|
|
|
dmu_tx_t *tx;
|
|
|
|
uint64_t i;
|
|
|
|
int error;
|
2010-08-26 22:13:05 +04:00
|
|
|
int size;
|
2009-07-03 02:44:48 +04:00
|
|
|
uint64_t n, s, txg;
|
|
|
|
bufwad_t *packbuf, *bigbuf;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t packobj, packoff, packsize, bigobj, bigoff, bigsize;
|
|
|
|
uint64_t blocksize = ztest_random_blocksize();
|
|
|
|
uint64_t chunksize = blocksize;
|
2009-07-03 02:44:48 +04:00
|
|
|
uint64_t regions = 997;
|
|
|
|
uint64_t stride = 123456789ULL;
|
|
|
|
uint64_t width = 9;
|
|
|
|
dmu_buf_t *bonus_db;
|
|
|
|
arc_buf_t **bigbuf_arcbufs;
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_object_info_t doi;
|
Adding Direct IO Support
Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.
O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated TXG is synced.
For both O_DIRECT read and write request the offset and request sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned unless the direct property is set to always (see
below).
For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,
then it will be discarded from the ARC forcing all further reads to
retrieve the data from disk.
For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event
that the requested data is in buffered (in the ARC) it will just be
copied from the ARC into the user buffer.
For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in
the event that file contents are mmap'ed. In this case, all requests
that are at least PAGE_SIZE aligned will just fall back to the buffered
paths. If the request however is not PAGE_SIZE aligned, EINVAL will
be returned as always regardless if the file's contents are mmap'ed.
Since O_DIRECT writes go through the normal ZIO pipeline, the
following operations are supported just as with normal buffered writes:
Checksum
Compression
Encryption
Erasure Coding
There is one caveat for the data integrity of O_DIRECT writes that is
distinct for each of the OS's supported by ZFS.
FreeBSD - FreeBSD is able to place user pages under write protection so
any data in the user buffers and written directly down to the
VDEV disks is guaranteed to not change. There is no concern
with data integrity and O_DIRECT writes.
Linux - Linux is not able to place anonymous user pages under write
protection. Because of this, if the user decides to manipulate
the page contents while the write operation is occurring, data
integrity can not be guaranteed. However, there is a module
parameter `zfs_vdev_direct_write_verify` that controls the
if a O_DIRECT writes that can occur to a top-level VDEV before
a checksum verify is run before the contents of the I/O buffer
are committed to disk. In the event of a checksum verification
failure the write will return EIO. The number of O_DIRECT write
checksum verification errors can be observed by doing
`zpool status -d`, which will list all verification errors that
have occurred on a top-level VDEV. Along with `zpool status`, a
ZED event will be issues as `dio_verify` when a checksum
verification error occurs.
ZVOLs and dedup is not currently supported with Direct I/O.
A new dataset property `direct` has been added with the following 3
allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats
the request as a buffered IO request.
standard - Follows the alignment restrictions outlined above for
write/read IO requests when the O_DIRECT flag is used.
always - Treats every write/read IO request as though it passed
O_DIRECT and will do O_DIRECT if the alignment restrictions
are met otherwise will redirect through the ARC. This
property will not allow a request to fail.
There is also a module parameter zfs_dio_enabled that can be used to
force all reads and writes through the ARC. By setting this module
parameter to 0, it mimics as if the direct dataset property is set to
disabled.
Reviewed-by: Brian Behlendorf <behlendorf@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Co-authored-by: Mark Maybee <mark.maybee@delphix.com>
Co-authored-by: Matt Macy <mmacy@FreeBSD.org>
Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov>
Closes #10018
2024-09-14 23:47:59 +03:00
|
|
|
uint32_t dmu_read_flags = DMU_READ_PREFETCH;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We will randomly set when to do O_DIRECT on a read.
|
|
|
|
*/
|
|
|
|
if (ztest_random(4) == 0)
|
|
|
|
dmu_read_flags |= DMU_DIRECTIO;
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
size = sizeof (ztest_od_t) * OD_ARRAY_SIZE;
|
2010-08-26 22:13:05 +04:00
|
|
|
od = umem_alloc(size, UMEM_NOFAIL);
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
/*
|
|
|
|
* This test uses two objects, packobj and bigobj, that are always
|
|
|
|
* updated together (i.e. in the same tx) so that their contents are
|
|
|
|
* in sync and can be compared. Their contents relate to each other
|
|
|
|
* in a simple way: packobj is a dense array of 'bufwad' structures,
|
|
|
|
* while bigobj is a sparse array of the same bufwads. Specifically,
|
|
|
|
* for any index n, there are three bufwads that should be identical:
|
|
|
|
*
|
|
|
|
* packobj, at offset n * sizeof (bufwad_t)
|
|
|
|
* bigobj, at the head of the nth chunk
|
|
|
|
* bigobj, at the tail of the nth chunk
|
|
|
|
*
|
|
|
|
* The chunk size is set equal to bigobj block size so that
|
2017-09-28 18:49:13 +03:00
|
|
|
* dmu_assign_arcbuf_by_dbuf() can be tested for object updates.
|
2009-07-03 02:44:48 +04:00
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Read the directory info. If it's the first time, set things up.
|
|
|
|
*/
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_od_init(od, id, FTAG, 0, DMU_OT_UINT64_OTHER, blocksize, 0, 0);
|
|
|
|
ztest_od_init(od + 1, id, FTAG, 1, DMU_OT_UINT64_OTHER, 0, 0,
|
|
|
|
chunksize);
|
2010-08-26 22:13:05 +04:00
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
if (ztest_object_init(zd, od, size, B_FALSE) != 0) {
|
|
|
|
umem_free(od, size);
|
2010-05-29 00:45:14 +04:00
|
|
|
return;
|
2010-08-26 22:13:05 +04:00
|
|
|
}
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
bigobj = od[0].od_object;
|
|
|
|
packobj = od[1].od_object;
|
|
|
|
blocksize = od[0].od_blocksize;
|
|
|
|
chunksize = blocksize;
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(chunksize, ==, od[1].od_gen);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_object_info(os, bigobj, &doi));
|
2010-05-29 00:45:14 +04:00
|
|
|
VERIFY(ISP2(doi.doi_data_block_size));
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY3U(chunksize, ==, doi.doi_data_block_size);
|
|
|
|
VERIFY3U(chunksize, >=, 2 * sizeof (bufwad_t));
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Pick a random index and compute the offsets into packobj and bigobj.
|
|
|
|
*/
|
|
|
|
n = ztest_random(regions) * stride + ztest_random(width);
|
|
|
|
s = 1 + ztest_random(width - 1);
|
|
|
|
|
|
|
|
packoff = n * sizeof (bufwad_t);
|
|
|
|
packsize = s * sizeof (bufwad_t);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
bigoff = n * chunksize;
|
|
|
|
bigsize = s * chunksize;
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
packbuf = umem_zalloc(packsize, UMEM_NOFAIL);
|
|
|
|
bigbuf = umem_zalloc(bigsize, UMEM_NOFAIL);
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_bonus_hold(os, bigobj, FTAG, &bonus_db));
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
bigbuf_arcbufs = umem_zalloc(2 * s * sizeof (arc_buf_t *), UMEM_NOFAIL);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Iteration 0 test zcopy for DB_UNCACHED dbufs.
|
|
|
|
* Iteration 1 test zcopy to already referenced dbufs.
|
|
|
|
* Iteration 2 test zcopy to dirty dbuf in the same txg.
|
|
|
|
* Iteration 3 test zcopy to dbuf dirty in previous txg.
|
|
|
|
* Iteration 4 test zcopy when dbuf is no longer dirty.
|
|
|
|
* Iteration 5 test zcopy when it can't be done.
|
|
|
|
* Iteration 6 one more zcopy write.
|
|
|
|
*/
|
|
|
|
for (i = 0; i < 7; i++) {
|
|
|
|
uint64_t j;
|
|
|
|
uint64_t off;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* In iteration 5 (i == 5) use arcbufs
|
|
|
|
* that don't match bigobj blksz to test
|
2017-09-28 18:49:13 +03:00
|
|
|
* dmu_assign_arcbuf_by_dbuf() when it can't directly
|
2009-07-03 02:44:48 +04:00
|
|
|
* assign an arcbuf to a dbuf.
|
|
|
|
*/
|
|
|
|
for (j = 0; j < s; j++) {
|
2014-12-30 06:12:23 +03:00
|
|
|
if (i != 5 || chunksize < (SPA_MINBLOCKSIZE * 2)) {
|
2009-07-03 02:44:48 +04:00
|
|
|
bigbuf_arcbufs[j] =
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_request_arcbuf(bonus_db, chunksize);
|
2009-07-03 02:44:48 +04:00
|
|
|
} else {
|
|
|
|
bigbuf_arcbufs[2 * j] =
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_request_arcbuf(bonus_db, chunksize / 2);
|
2009-07-03 02:44:48 +04:00
|
|
|
bigbuf_arcbufs[2 * j + 1] =
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_request_arcbuf(bonus_db, chunksize / 2);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Get a tx for the mods to both packobj and bigobj.
|
|
|
|
*/
|
|
|
|
tx = dmu_tx_create(os);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_tx_hold_write(tx, packobj, packoff, packsize);
|
|
|
|
dmu_tx_hold_write(tx, bigobj, bigoff, bigsize);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
txg = ztest_tx_assign(tx, TXG_MIGHTWAIT, FTAG);
|
|
|
|
if (txg == 0) {
|
2009-07-03 02:44:48 +04:00
|
|
|
umem_free(packbuf, packsize);
|
|
|
|
umem_free(bigbuf, bigsize);
|
|
|
|
for (j = 0; j < s; j++) {
|
2014-12-30 06:12:23 +03:00
|
|
|
if (i != 5 ||
|
|
|
|
chunksize < (SPA_MINBLOCKSIZE * 2)) {
|
2009-07-03 02:44:48 +04:00
|
|
|
dmu_return_arcbuf(bigbuf_arcbufs[j]);
|
|
|
|
} else {
|
|
|
|
dmu_return_arcbuf(
|
|
|
|
bigbuf_arcbufs[2 * j]);
|
|
|
|
dmu_return_arcbuf(
|
|
|
|
bigbuf_arcbufs[2 * j + 1]);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
umem_free(bigbuf_arcbufs, 2 * s * sizeof (arc_buf_t *));
|
2010-08-26 22:13:05 +04:00
|
|
|
umem_free(od, size);
|
2009-07-03 02:44:48 +04:00
|
|
|
dmu_buf_rele(bonus_db, FTAG);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* 50% of the time don't read objects in the 1st iteration to
|
2017-09-28 18:49:13 +03:00
|
|
|
* test dmu_assign_arcbuf_by_dbuf() for the case when there are
|
|
|
|
* no existing dbufs for the specified offsets.
|
2009-07-03 02:44:48 +04:00
|
|
|
*/
|
|
|
|
if (i != 0 || ztest_random(2) != 0) {
|
2010-05-29 00:45:14 +04:00
|
|
|
error = dmu_read(os, packobj, packoff,
|
Adding Direct IO Support
Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.
O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated TXG is synced.
For both O_DIRECT read and write request the offset and request sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned unless the direct property is set to always (see
below).
For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,
then it will be discarded from the ARC forcing all further reads to
retrieve the data from disk.
For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event
that the requested data is in buffered (in the ARC) it will just be
copied from the ARC into the user buffer.
For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in
the event that file contents are mmap'ed. In this case, all requests
that are at least PAGE_SIZE aligned will just fall back to the buffered
paths. If the request however is not PAGE_SIZE aligned, EINVAL will
be returned as always regardless if the file's contents are mmap'ed.
Since O_DIRECT writes go through the normal ZIO pipeline, the
following operations are supported just as with normal buffered writes:
Checksum
Compression
Encryption
Erasure Coding
There is one caveat for the data integrity of O_DIRECT writes that is
distinct for each of the OS's supported by ZFS.
FreeBSD - FreeBSD is able to place user pages under write protection so
any data in the user buffers and written directly down to the
VDEV disks is guaranteed to not change. There is no concern
with data integrity and O_DIRECT writes.
Linux - Linux is not able to place anonymous user pages under write
protection. Because of this, if the user decides to manipulate
the page contents while the write operation is occurring, data
integrity can not be guaranteed. However, there is a module
parameter `zfs_vdev_direct_write_verify` that controls the
if a O_DIRECT writes that can occur to a top-level VDEV before
a checksum verify is run before the contents of the I/O buffer
are committed to disk. In the event of a checksum verification
failure the write will return EIO. The number of O_DIRECT write
checksum verification errors can be observed by doing
`zpool status -d`, which will list all verification errors that
have occurred on a top-level VDEV. Along with `zpool status`, a
ZED event will be issues as `dio_verify` when a checksum
verification error occurs.
ZVOLs and dedup is not currently supported with Direct I/O.
A new dataset property `direct` has been added with the following 3
allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats
the request as a buffered IO request.
standard - Follows the alignment restrictions outlined above for
write/read IO requests when the O_DIRECT flag is used.
always - Treats every write/read IO request as though it passed
O_DIRECT and will do O_DIRECT if the alignment restrictions
are met otherwise will redirect through the ARC. This
property will not allow a request to fail.
There is also a module parameter zfs_dio_enabled that can be used to
force all reads and writes through the ARC. By setting this module
parameter to 0, it mimics as if the direct dataset property is set to
disabled.
Reviewed-by: Brian Behlendorf <behlendorf@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Co-authored-by: Mark Maybee <mark.maybee@delphix.com>
Co-authored-by: Matt Macy <mmacy@FreeBSD.org>
Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov>
Closes #10018
2024-09-14 23:47:59 +03:00
|
|
|
packsize, packbuf, dmu_read_flags);
|
2013-05-11 01:17:03 +04:00
|
|
|
ASSERT0(error);
|
2010-05-29 00:45:14 +04:00
|
|
|
error = dmu_read(os, bigobj, bigoff, bigsize,
|
Adding Direct IO Support
Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.
O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated TXG is synced.
For both O_DIRECT read and write request the offset and request sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned unless the direct property is set to always (see
below).
For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,
then it will be discarded from the ARC forcing all further reads to
retrieve the data from disk.
For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event
that the requested data is in buffered (in the ARC) it will just be
copied from the ARC into the user buffer.
For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in
the event that file contents are mmap'ed. In this case, all requests
that are at least PAGE_SIZE aligned will just fall back to the buffered
paths. If the request however is not PAGE_SIZE aligned, EINVAL will
be returned as always regardless if the file's contents are mmap'ed.
Since O_DIRECT writes go through the normal ZIO pipeline, the
following operations are supported just as with normal buffered writes:
Checksum
Compression
Encryption
Erasure Coding
There is one caveat for the data integrity of O_DIRECT writes that is
distinct for each of the OS's supported by ZFS.
FreeBSD - FreeBSD is able to place user pages under write protection so
any data in the user buffers and written directly down to the
VDEV disks is guaranteed to not change. There is no concern
with data integrity and O_DIRECT writes.
Linux - Linux is not able to place anonymous user pages under write
protection. Because of this, if the user decides to manipulate
the page contents while the write operation is occurring, data
integrity can not be guaranteed. However, there is a module
parameter `zfs_vdev_direct_write_verify` that controls the
if a O_DIRECT writes that can occur to a top-level VDEV before
a checksum verify is run before the contents of the I/O buffer
are committed to disk. In the event of a checksum verification
failure the write will return EIO. The number of O_DIRECT write
checksum verification errors can be observed by doing
`zpool status -d`, which will list all verification errors that
have occurred on a top-level VDEV. Along with `zpool status`, a
ZED event will be issues as `dio_verify` when a checksum
verification error occurs.
ZVOLs and dedup is not currently supported with Direct I/O.
A new dataset property `direct` has been added with the following 3
allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats
the request as a buffered IO request.
standard - Follows the alignment restrictions outlined above for
write/read IO requests when the O_DIRECT flag is used.
always - Treats every write/read IO request as though it passed
O_DIRECT and will do O_DIRECT if the alignment restrictions
are met otherwise will redirect through the ARC. This
property will not allow a request to fail.
There is also a module parameter zfs_dio_enabled that can be used to
force all reads and writes through the ARC. By setting this module
parameter to 0, it mimics as if the direct dataset property is set to
disabled.
Reviewed-by: Brian Behlendorf <behlendorf@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Co-authored-by: Mark Maybee <mark.maybee@delphix.com>
Co-authored-by: Matt Macy <mmacy@FreeBSD.org>
Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov>
Closes #10018
2024-09-14 23:47:59 +03:00
|
|
|
bigbuf, dmu_read_flags);
|
2013-05-11 01:17:03 +04:00
|
|
|
ASSERT0(error);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
compare_and_update_pbbufs(s, packbuf, bigbuf, bigsize,
|
2010-05-29 00:45:14 +04:00
|
|
|
n, chunksize, txg);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We've verified all the old bufwads, and made new ones.
|
|
|
|
* Now write them out.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_write(os, packobj, packoff, packsize, packbuf, tx);
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 7) {
|
2021-06-05 15:18:37 +03:00
|
|
|
(void) printf("writing offset %"PRIx64" size %"PRIx64""
|
|
|
|
" txg %"PRIx64"\n",
|
|
|
|
bigoff, bigsize, txg);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
for (off = bigoff, j = 0; j < s; j++, off += chunksize) {
|
2009-07-03 02:44:48 +04:00
|
|
|
dmu_buf_t *dbt;
|
2014-12-30 06:12:23 +03:00
|
|
|
if (i != 5 || chunksize < (SPA_MINBLOCKSIZE * 2)) {
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(bigbuf_arcbufs[j]->b_data,
|
|
|
|
(caddr_t)bigbuf + (off - bigoff),
|
|
|
|
chunksize);
|
2009-07-03 02:44:48 +04:00
|
|
|
} else {
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(bigbuf_arcbufs[2 * j]->b_data,
|
|
|
|
(caddr_t)bigbuf + (off - bigoff),
|
2010-05-29 00:45:14 +04:00
|
|
|
chunksize / 2);
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(bigbuf_arcbufs[2 * j + 1]->b_data,
|
|
|
|
(caddr_t)bigbuf + (off - bigoff) +
|
2010-05-29 00:45:14 +04:00
|
|
|
chunksize / 2,
|
|
|
|
chunksize / 2);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
if (i == 1) {
|
2010-05-29 00:45:14 +04:00
|
|
|
VERIFY(dmu_buf_hold(os, bigobj, off,
|
|
|
|
FTAG, &dbt, DMU_READ_NO_PREFETCH) == 0);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
2014-12-30 06:12:23 +03:00
|
|
|
if (i != 5 || chunksize < (SPA_MINBLOCKSIZE * 2)) {
|
2019-01-18 02:47:08 +03:00
|
|
|
VERIFY0(dmu_assign_arcbuf_by_dbuf(bonus_db,
|
|
|
|
off, bigbuf_arcbufs[j], tx));
|
2009-07-03 02:44:48 +04:00
|
|
|
} else {
|
2019-01-18 02:47:08 +03:00
|
|
|
VERIFY0(dmu_assign_arcbuf_by_dbuf(bonus_db,
|
|
|
|
off, bigbuf_arcbufs[2 * j], tx));
|
|
|
|
VERIFY0(dmu_assign_arcbuf_by_dbuf(bonus_db,
|
2010-05-29 00:45:14 +04:00
|
|
|
off + chunksize / 2,
|
2019-01-18 02:47:08 +03:00
|
|
|
bigbuf_arcbufs[2 * j + 1], tx));
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
if (i == 1) {
|
|
|
|
dmu_buf_rele(dbt, FTAG);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
dmu_tx_commit(tx);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Sanity check the stuff we just wrote.
|
|
|
|
*/
|
|
|
|
{
|
|
|
|
void *packcheck = umem_alloc(packsize, UMEM_NOFAIL);
|
|
|
|
void *bigcheck = umem_alloc(bigsize, UMEM_NOFAIL);
|
|
|
|
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
VERIFY0(dmu_read(os, packobj, packoff,
|
Adding Direct IO Support
Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.
O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated TXG is synced.
For both O_DIRECT read and write request the offset and request sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned unless the direct property is set to always (see
below).
For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,
then it will be discarded from the ARC forcing all further reads to
retrieve the data from disk.
For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event
that the requested data is in buffered (in the ARC) it will just be
copied from the ARC into the user buffer.
For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in
the event that file contents are mmap'ed. In this case, all requests
that are at least PAGE_SIZE aligned will just fall back to the buffered
paths. If the request however is not PAGE_SIZE aligned, EINVAL will
be returned as always regardless if the file's contents are mmap'ed.
Since O_DIRECT writes go through the normal ZIO pipeline, the
following operations are supported just as with normal buffered writes:
Checksum
Compression
Encryption
Erasure Coding
There is one caveat for the data integrity of O_DIRECT writes that is
distinct for each of the OS's supported by ZFS.
FreeBSD - FreeBSD is able to place user pages under write protection so
any data in the user buffers and written directly down to the
VDEV disks is guaranteed to not change. There is no concern
with data integrity and O_DIRECT writes.
Linux - Linux is not able to place anonymous user pages under write
protection. Because of this, if the user decides to manipulate
the page contents while the write operation is occurring, data
integrity can not be guaranteed. However, there is a module
parameter `zfs_vdev_direct_write_verify` that controls the
if a O_DIRECT writes that can occur to a top-level VDEV before
a checksum verify is run before the contents of the I/O buffer
are committed to disk. In the event of a checksum verification
failure the write will return EIO. The number of O_DIRECT write
checksum verification errors can be observed by doing
`zpool status -d`, which will list all verification errors that
have occurred on a top-level VDEV. Along with `zpool status`, a
ZED event will be issues as `dio_verify` when a checksum
verification error occurs.
ZVOLs and dedup is not currently supported with Direct I/O.
A new dataset property `direct` has been added with the following 3
allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats
the request as a buffered IO request.
standard - Follows the alignment restrictions outlined above for
write/read IO requests when the O_DIRECT flag is used.
always - Treats every write/read IO request as though it passed
O_DIRECT and will do O_DIRECT if the alignment restrictions
are met otherwise will redirect through the ARC. This
property will not allow a request to fail.
There is also a module parameter zfs_dio_enabled that can be used to
force all reads and writes through the ARC. By setting this module
parameter to 0, it mimics as if the direct dataset property is set to
disabled.
Reviewed-by: Brian Behlendorf <behlendorf@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Co-authored-by: Mark Maybee <mark.maybee@delphix.com>
Co-authored-by: Matt Macy <mmacy@FreeBSD.org>
Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov>
Closes #10018
2024-09-14 23:47:59 +03:00
|
|
|
packsize, packcheck, dmu_read_flags));
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
VERIFY0(dmu_read(os, bigobj, bigoff,
|
Adding Direct IO Support
Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.
O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated TXG is synced.
For both O_DIRECT read and write request the offset and request sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned unless the direct property is set to always (see
below).
For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,
then it will be discarded from the ARC forcing all further reads to
retrieve the data from disk.
For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event
that the requested data is in buffered (in the ARC) it will just be
copied from the ARC into the user buffer.
For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in
the event that file contents are mmap'ed. In this case, all requests
that are at least PAGE_SIZE aligned will just fall back to the buffered
paths. If the request however is not PAGE_SIZE aligned, EINVAL will
be returned as always regardless if the file's contents are mmap'ed.
Since O_DIRECT writes go through the normal ZIO pipeline, the
following operations are supported just as with normal buffered writes:
Checksum
Compression
Encryption
Erasure Coding
There is one caveat for the data integrity of O_DIRECT writes that is
distinct for each of the OS's supported by ZFS.
FreeBSD - FreeBSD is able to place user pages under write protection so
any data in the user buffers and written directly down to the
VDEV disks is guaranteed to not change. There is no concern
with data integrity and O_DIRECT writes.
Linux - Linux is not able to place anonymous user pages under write
protection. Because of this, if the user decides to manipulate
the page contents while the write operation is occurring, data
integrity can not be guaranteed. However, there is a module
parameter `zfs_vdev_direct_write_verify` that controls the
if a O_DIRECT writes that can occur to a top-level VDEV before
a checksum verify is run before the contents of the I/O buffer
are committed to disk. In the event of a checksum verification
failure the write will return EIO. The number of O_DIRECT write
checksum verification errors can be observed by doing
`zpool status -d`, which will list all verification errors that
have occurred on a top-level VDEV. Along with `zpool status`, a
ZED event will be issues as `dio_verify` when a checksum
verification error occurs.
ZVOLs and dedup is not currently supported with Direct I/O.
A new dataset property `direct` has been added with the following 3
allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats
the request as a buffered IO request.
standard - Follows the alignment restrictions outlined above for
write/read IO requests when the O_DIRECT flag is used.
always - Treats every write/read IO request as though it passed
O_DIRECT and will do O_DIRECT if the alignment restrictions
are met otherwise will redirect through the ARC. This
property will not allow a request to fail.
There is also a module parameter zfs_dio_enabled that can be used to
force all reads and writes through the ARC. By setting this module
parameter to 0, it mimics as if the direct dataset property is set to
disabled.
Reviewed-by: Brian Behlendorf <behlendorf@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Co-authored-by: Mark Maybee <mark.maybee@delphix.com>
Co-authored-by: Matt Macy <mmacy@FreeBSD.org>
Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov>
Closes #10018
2024-09-14 23:47:59 +03:00
|
|
|
bigsize, bigcheck, dmu_read_flags));
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2022-02-25 16:26:54 +03:00
|
|
|
ASSERT0(memcmp(packbuf, packcheck, packsize));
|
|
|
|
ASSERT0(memcmp(bigbuf, bigcheck, bigsize));
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
umem_free(packcheck, packsize);
|
|
|
|
umem_free(bigcheck, bigsize);
|
|
|
|
}
|
|
|
|
if (i == 2) {
|
2019-03-29 19:13:20 +03:00
|
|
|
txg_wait_open(dmu_objset_pool(os), 0, B_TRUE);
|
2009-07-03 02:44:48 +04:00
|
|
|
} else if (i == 3) {
|
|
|
|
txg_wait_synced(dmu_objset_pool(os), 0);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
dmu_buf_rele(bonus_db, FTAG);
|
|
|
|
umem_free(packbuf, packsize);
|
|
|
|
umem_free(bigbuf, bigsize);
|
|
|
|
umem_free(bigbuf_arcbufs, 2 * s * sizeof (arc_buf_t *));
|
2010-08-26 22:13:05 +04:00
|
|
|
umem_free(od, size);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_dmu_write_parallel(ztest_ds_t *zd, uint64_t id)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) id;
|
2010-08-26 22:13:05 +04:00
|
|
|
ztest_od_t *od;
|
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
od = umem_alloc(sizeof (ztest_od_t), UMEM_NOFAIL);
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t offset = (1ULL << (ztest_random(20) + 43)) +
|
|
|
|
(ztest_random(ZTEST_RANGE_LOCKS) << SPA_MAXBLOCKSHIFT);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Have multiple threads write to large offsets in an object
|
|
|
|
* to verify that parallel writes to an object -- even to the
|
|
|
|
* same blocks within the object -- doesn't cause any trouble.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_od_init(od, ID_PARALLEL, FTAG, 0, DMU_OT_UINT64_OTHER, 0, 0, 0);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
if (ztest_object_init(zd, od, sizeof (ztest_od_t), B_FALSE) != 0)
|
2008-11-20 23:01:55 +03:00
|
|
|
return;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
while (ztest_random(10) != 0)
|
2010-08-26 22:13:05 +04:00
|
|
|
ztest_io(zd, od->od_object, offset);
|
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
umem_free(od, sizeof (ztest_od_t));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
void
|
|
|
|
ztest_dmu_prealloc(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
2010-08-26 22:13:05 +04:00
|
|
|
ztest_od_t *od;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t offset = (1ULL << (ztest_random(4) + SPA_MAXBLOCKSHIFT)) +
|
|
|
|
(ztest_random(ZTEST_RANGE_LOCKS) << SPA_MAXBLOCKSHIFT);
|
|
|
|
uint64_t count = ztest_random(20) + 1;
|
|
|
|
uint64_t blocksize = ztest_random_blocksize();
|
|
|
|
void *data;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
od = umem_alloc(sizeof (ztest_od_t), UMEM_NOFAIL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_od_init(od, id, FTAG, 0, DMU_OT_UINT64_OTHER, blocksize, 0, 0);
|
2010-08-26 22:13:05 +04:00
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
if (ztest_object_init(zd, od, sizeof (ztest_od_t),
|
|
|
|
!ztest_random(2)) != 0) {
|
|
|
|
umem_free(od, sizeof (ztest_od_t));
|
2008-11-20 23:01:55 +03:00
|
|
|
return;
|
2010-08-26 22:13:05 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
if (ztest_truncate(zd, od->od_object, offset, count * blocksize) != 0) {
|
2013-11-01 23:26:11 +04:00
|
|
|
umem_free(od, sizeof (ztest_od_t));
|
2008-11-20 23:01:55 +03:00
|
|
|
return;
|
2010-08-26 22:13:05 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
ztest_prealloc(zd, od->od_object, offset, count * blocksize);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
data = umem_zalloc(blocksize, UMEM_NOFAIL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
while (ztest_random(count) != 0) {
|
|
|
|
uint64_t randoff = offset + (ztest_random(count) * blocksize);
|
2010-08-26 22:13:05 +04:00
|
|
|
if (ztest_write(zd, od->od_object, randoff, blocksize,
|
2010-05-29 00:45:14 +04:00
|
|
|
data) != 0)
|
|
|
|
break;
|
|
|
|
while (ztest_random(4) != 0)
|
2010-08-26 22:13:05 +04:00
|
|
|
ztest_io(zd, od->od_object, randoff);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
umem_free(data, blocksize);
|
2013-11-01 23:26:11 +04:00
|
|
|
umem_free(od, sizeof (ztest_od_t));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that zap_{create,destroy,add,remove,update} work as expected.
|
|
|
|
*/
|
|
|
|
#define ZTEST_ZAP_MIN_INTS 1
|
|
|
|
#define ZTEST_ZAP_MAX_INTS 4
|
|
|
|
#define ZTEST_ZAP_MAX_PROPS 1000
|
|
|
|
|
|
|
|
void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_zap(ztest_ds_t *zd, uint64_t id)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *os = zd->zd_os;
|
2010-08-26 22:13:05 +04:00
|
|
|
ztest_od_t *od;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t object;
|
|
|
|
uint64_t txg, last_txg;
|
|
|
|
uint64_t value[ZTEST_ZAP_MAX_INTS];
|
|
|
|
uint64_t zl_ints, zl_intsize, prop;
|
|
|
|
int i, ints;
|
|
|
|
dmu_tx_t *tx;
|
|
|
|
char propname[100], txgname[100];
|
|
|
|
int error;
|
2022-04-19 21:38:30 +03:00
|
|
|
const char *const hc[2] = { "s.acl.h", ".s.open.h.hyLZlg" };
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
od = umem_alloc(sizeof (ztest_od_t), UMEM_NOFAIL);
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_od_init(od, id, FTAG, 0, DMU_OT_ZAP_OTHER, 0, 0, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
if (ztest_object_init(zd, od, sizeof (ztest_od_t),
|
2016-12-12 21:46:26 +03:00
|
|
|
!ztest_random(2)) != 0)
|
2010-08-26 22:13:05 +04:00
|
|
|
goto out;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
object = od->od_object;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Generate a known hash collision, and verify that
|
|
|
|
* we can lookup and remove both entries.
|
|
|
|
*/
|
|
|
|
tx = dmu_tx_create(os);
|
|
|
|
dmu_tx_hold_zap(tx, object, B_TRUE, NULL);
|
|
|
|
txg = ztest_tx_assign(tx, TXG_MIGHTWAIT, FTAG);
|
|
|
|
if (txg == 0)
|
2010-08-26 22:13:05 +04:00
|
|
|
goto out;
|
2010-05-29 00:45:14 +04:00
|
|
|
for (i = 0; i < 2; i++) {
|
|
|
|
value[i] = i;
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(zap_add(os, object, hc[i], sizeof (uint64_t),
|
2010-05-29 00:45:14 +04:00
|
|
|
1, &value[i], tx));
|
|
|
|
}
|
|
|
|
for (i = 0; i < 2; i++) {
|
|
|
|
VERIFY3U(EEXIST, ==, zap_add(os, object, hc[i],
|
|
|
|
sizeof (uint64_t), 1, &value[i], tx));
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(
|
2010-05-29 00:45:14 +04:00
|
|
|
zap_length(os, object, hc[i], &zl_intsize, &zl_ints));
|
|
|
|
ASSERT3U(zl_intsize, ==, sizeof (uint64_t));
|
|
|
|
ASSERT3U(zl_ints, ==, 1);
|
|
|
|
}
|
|
|
|
for (i = 0; i < 2; i++) {
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(zap_remove(os, object, hc[i], tx));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_tx_commit(tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
2019-01-13 21:11:52 +03:00
|
|
|
* Generate a bunch of random entries.
|
2010-05-29 00:45:14 +04:00
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
ints = MAX(ZTEST_ZAP_MIN_INTS, object % ZTEST_ZAP_MAX_INTS);
|
|
|
|
|
|
|
|
prop = ztest_random(ZTEST_ZAP_MAX_PROPS);
|
2021-06-05 15:18:37 +03:00
|
|
|
(void) sprintf(propname, "prop_%"PRIu64"", prop);
|
|
|
|
(void) sprintf(txgname, "txg_%"PRIu64"", prop);
|
2022-02-25 16:26:54 +03:00
|
|
|
memset(value, 0, sizeof (value));
|
2008-11-20 23:01:55 +03:00
|
|
|
last_txg = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If these zap entries already exist, validate their contents.
|
|
|
|
*/
|
|
|
|
error = zap_length(os, object, txgname, &zl_intsize, &zl_ints);
|
|
|
|
if (error == 0) {
|
|
|
|
ASSERT3U(zl_intsize, ==, sizeof (uint64_t));
|
|
|
|
ASSERT3U(zl_ints, ==, 1);
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(zap_lookup(os, object, txgname, zl_intsize,
|
|
|
|
zl_ints, &last_txg));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(zap_length(os, object, propname, &zl_intsize,
|
|
|
|
&zl_ints));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT3U(zl_intsize, ==, sizeof (uint64_t));
|
|
|
|
ASSERT3U(zl_ints, ==, ints);
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(zap_lookup(os, object, propname, zl_intsize,
|
|
|
|
zl_ints, value));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
for (i = 0; i < ints; i++) {
|
|
|
|
ASSERT3U(value[i], ==, last_txg + object + i);
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
ASSERT3U(error, ==, ENOENT);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Atomically update two entries in our zap object.
|
|
|
|
* The first is named txg_%llu, and contains the txg
|
|
|
|
* in which the property was last updated. The second
|
|
|
|
* is named prop_%llu, and the nth element of its value
|
|
|
|
* should be txg + object + n.
|
|
|
|
*/
|
|
|
|
tx = dmu_tx_create(os);
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_tx_hold_zap(tx, object, B_TRUE, NULL);
|
|
|
|
txg = ztest_tx_assign(tx, TXG_MIGHTWAIT, FTAG);
|
|
|
|
if (txg == 0)
|
2010-08-26 22:13:05 +04:00
|
|
|
goto out;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (last_txg > txg)
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "zap future leak: old %"PRIu64" new %"PRIu64"",
|
|
|
|
last_txg, txg);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
for (i = 0; i < ints; i++)
|
|
|
|
value[i] = txg + object + i;
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(zap_update(os, object, txgname, sizeof (uint64_t),
|
2010-05-29 00:45:14 +04:00
|
|
|
1, &txg, tx));
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(zap_update(os, object, propname, sizeof (uint64_t),
|
2010-05-29 00:45:14 +04:00
|
|
|
ints, value, tx));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
dmu_tx_commit(tx);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remove a random pair of entries.
|
|
|
|
*/
|
|
|
|
prop = ztest_random(ZTEST_ZAP_MAX_PROPS);
|
2021-06-05 15:18:37 +03:00
|
|
|
(void) sprintf(propname, "prop_%"PRIu64"", prop);
|
|
|
|
(void) sprintf(txgname, "txg_%"PRIu64"", prop);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
error = zap_length(os, object, txgname, &zl_intsize, &zl_ints);
|
|
|
|
|
|
|
|
if (error == ENOENT)
|
2010-08-26 22:13:05 +04:00
|
|
|
goto out;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-05-11 01:17:03 +04:00
|
|
|
ASSERT0(error);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
tx = dmu_tx_create(os);
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_tx_hold_zap(tx, object, B_TRUE, NULL);
|
|
|
|
txg = ztest_tx_assign(tx, TXG_MIGHTWAIT, FTAG);
|
|
|
|
if (txg == 0)
|
2010-08-26 22:13:05 +04:00
|
|
|
goto out;
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(zap_remove(os, object, txgname, tx));
|
|
|
|
VERIFY0(zap_remove(os, object, propname, tx));
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_tx_commit(tx);
|
2010-08-26 22:13:05 +04:00
|
|
|
out:
|
2013-11-01 23:26:11 +04:00
|
|
|
umem_free(od, sizeof (ztest_od_t));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
2019-01-13 21:11:52 +03:00
|
|
|
* Test case to test the upgrading of a microzap to fatzap.
|
2010-05-29 00:45:14 +04:00
|
|
|
*/
|
|
|
|
void
|
|
|
|
ztest_fzap(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
|
|
|
objset_t *os = zd->zd_os;
|
2010-08-26 22:13:05 +04:00
|
|
|
ztest_od_t *od;
|
2021-06-05 15:18:37 +03:00
|
|
|
uint64_t object, txg, value;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
od = umem_alloc(sizeof (ztest_od_t), UMEM_NOFAIL);
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_od_init(od, id, FTAG, 0, DMU_OT_ZAP_OTHER, 0, 0, 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
if (ztest_object_init(zd, od, sizeof (ztest_od_t),
|
2016-12-12 21:46:26 +03:00
|
|
|
!ztest_random(2)) != 0)
|
2010-08-26 22:13:05 +04:00
|
|
|
goto out;
|
|
|
|
object = od->od_object;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Add entries to this ZAP and make sure it spills over
|
|
|
|
* and gets upgraded to a fatzap. Also, since we are adding
|
|
|
|
* 2050 entries we should see ptrtbl growth and leaf-block split.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2021-06-05 15:18:37 +03:00
|
|
|
for (value = 0; value < 2050; value++) {
|
2016-06-16 00:28:36 +03:00
|
|
|
char name[ZFS_MAX_DATASET_NAME_LEN];
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_tx_t *tx;
|
|
|
|
int error;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-06-05 15:18:37 +03:00
|
|
|
(void) snprintf(name, sizeof (name), "fzap-%"PRIu64"-%"PRIu64"",
|
|
|
|
id, value);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
tx = dmu_tx_create(os);
|
|
|
|
dmu_tx_hold_zap(tx, object, B_TRUE, name);
|
|
|
|
txg = ztest_tx_assign(tx, TXG_MIGHTWAIT, FTAG);
|
|
|
|
if (txg == 0)
|
2010-08-26 22:13:05 +04:00
|
|
|
goto out;
|
2010-05-29 00:45:14 +04:00
|
|
|
error = zap_add(os, object, name, sizeof (uint64_t), 1,
|
|
|
|
&value, tx);
|
|
|
|
ASSERT(error == 0 || error == EEXIST);
|
|
|
|
dmu_tx_commit(tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-08-26 22:13:05 +04:00
|
|
|
out:
|
2013-11-01 23:26:11 +04:00
|
|
|
umem_free(od, sizeof (ztest_od_t));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_zap_parallel(ztest_ds_t *zd, uint64_t id)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) id;
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *os = zd->zd_os;
|
2010-08-26 22:13:05 +04:00
|
|
|
ztest_od_t *od;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t txg, object, count, wsize, wc, zl_wsize, zl_wc;
|
|
|
|
dmu_tx_t *tx;
|
|
|
|
int i, namelen, error;
|
2010-05-29 00:45:14 +04:00
|
|
|
int micro = ztest_random(2);
|
2008-11-20 23:01:55 +03:00
|
|
|
char name[20], string_value[20];
|
|
|
|
void *data;
|
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
od = umem_alloc(sizeof (ztest_od_t), UMEM_NOFAIL);
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_od_init(od, ID_PARALLEL, FTAG, micro, DMU_OT_ZAP_OTHER, 0, 0, 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
if (ztest_object_init(zd, od, sizeof (ztest_od_t), B_FALSE) != 0) {
|
2013-11-01 23:26:11 +04:00
|
|
|
umem_free(od, sizeof (ztest_od_t));
|
2010-05-29 00:45:14 +04:00
|
|
|
return;
|
2010-08-26 22:13:05 +04:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
object = od->od_object;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Generate a random name of the form 'xxx.....' where each
|
|
|
|
* x is a random printable character and the dots are dots.
|
|
|
|
* There are 94 such characters, and the name length goes from
|
|
|
|
* 6 to 20, so there are 94^3 * 15 = 12,458,760 possible names.
|
|
|
|
*/
|
|
|
|
namelen = ztest_random(sizeof (name) - 5) + 5 + 1;
|
|
|
|
|
|
|
|
for (i = 0; i < 3; i++)
|
|
|
|
name[i] = '!' + ztest_random('~' - '!' + 1);
|
|
|
|
for (; i < namelen - 1; i++)
|
|
|
|
name[i] = '.';
|
|
|
|
name[i] = '\0';
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if ((namelen & 1) || micro) {
|
2008-11-20 23:01:55 +03:00
|
|
|
wsize = sizeof (txg);
|
|
|
|
wc = 1;
|
|
|
|
data = &txg;
|
|
|
|
} else {
|
|
|
|
wsize = 1;
|
|
|
|
wc = namelen;
|
|
|
|
data = string_value;
|
|
|
|
}
|
|
|
|
|
|
|
|
count = -1ULL;
|
2013-09-04 16:00:57 +04:00
|
|
|
VERIFY0(zap_count(os, object, &count));
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3S(count, !=, -1ULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Select an operation: length, lookup, add, update, remove.
|
|
|
|
*/
|
|
|
|
i = ztest_random(5);
|
|
|
|
|
|
|
|
if (i >= 2) {
|
|
|
|
tx = dmu_tx_create(os);
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_tx_hold_zap(tx, object, B_TRUE, NULL);
|
|
|
|
txg = ztest_tx_assign(tx, TXG_MIGHTWAIT, FTAG);
|
2016-07-28 23:10:05 +03:00
|
|
|
if (txg == 0) {
|
|
|
|
umem_free(od, sizeof (ztest_od_t));
|
2008-11-20 23:01:55 +03:00
|
|
|
return;
|
2016-07-28 23:10:05 +03:00
|
|
|
}
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(string_value, name, namelen);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
|
|
|
tx = NULL;
|
|
|
|
txg = 0;
|
2022-02-25 16:26:54 +03:00
|
|
|
memset(string_value, 0, namelen);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
switch (i) {
|
|
|
|
|
|
|
|
case 0:
|
|
|
|
error = zap_length(os, object, name, &zl_wsize, &zl_wc);
|
|
|
|
if (error == 0) {
|
|
|
|
ASSERT3U(wsize, ==, zl_wsize);
|
|
|
|
ASSERT3U(wc, ==, zl_wc);
|
|
|
|
} else {
|
|
|
|
ASSERT3U(error, ==, ENOENT);
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
|
|
|
|
case 1:
|
|
|
|
error = zap_lookup(os, object, name, wsize, wc, data);
|
|
|
|
if (error == 0) {
|
|
|
|
if (data == string_value &&
|
2022-02-25 16:26:54 +03:00
|
|
|
memcmp(name, data, namelen) != 0)
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "name '%s' != val '%s' len %d",
|
|
|
|
name, (char *)data, namelen);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
|
|
|
ASSERT3U(error, ==, ENOENT);
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
|
|
|
|
case 2:
|
|
|
|
error = zap_add(os, object, name, wsize, wc, data, tx);
|
|
|
|
ASSERT(error == 0 || error == EEXIST);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case 3:
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(zap_update(os, object, name, wsize, wc, data, tx));
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
|
|
|
|
case 4:
|
|
|
|
error = zap_remove(os, object, name, tx);
|
|
|
|
ASSERT(error == 0 || error == ENOENT);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (tx != NULL)
|
|
|
|
dmu_tx_commit(tx);
|
2010-08-26 22:13:05 +04:00
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
umem_free(od, sizeof (ztest_od_t));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Commit callback data.
|
|
|
|
*/
|
|
|
|
typedef struct ztest_cb_data {
|
|
|
|
list_node_t zcd_node;
|
|
|
|
uint64_t zcd_txg;
|
|
|
|
int zcd_expected_err;
|
|
|
|
boolean_t zcd_added;
|
|
|
|
boolean_t zcd_called;
|
|
|
|
spa_t *zcd_spa;
|
|
|
|
} ztest_cb_data_t;
|
|
|
|
|
|
|
|
/* This is the actual commit callback function */
|
|
|
|
static void
|
|
|
|
ztest_commit_callback(void *arg, int error)
|
|
|
|
{
|
|
|
|
ztest_cb_data_t *data = arg;
|
|
|
|
uint64_t synced_txg;
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY3P(data, !=, NULL);
|
2010-05-29 00:45:14 +04:00
|
|
|
VERIFY3S(data->zcd_expected_err, ==, error);
|
|
|
|
VERIFY(!data->zcd_called);
|
|
|
|
|
|
|
|
synced_txg = spa_last_synced_txg(data->zcd_spa);
|
|
|
|
if (data->zcd_txg > synced_txg)
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE,
|
|
|
|
"commit callback of txg %"PRIu64" called prematurely, "
|
|
|
|
"last synced txg = %"PRIu64"\n",
|
|
|
|
data->zcd_txg, synced_txg);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
data->zcd_called = B_TRUE;
|
|
|
|
|
|
|
|
if (error == ECANCELED) {
|
2013-05-11 01:17:03 +04:00
|
|
|
ASSERT0(data->zcd_txg);
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(!data->zcd_added);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The private callback data should be destroyed here, but
|
|
|
|
* since we are going to check the zcd_called field after
|
|
|
|
* dmu_tx_abort(), we will destroy it there.
|
|
|
|
*/
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2010-08-26 21:17:18 +04:00
|
|
|
ASSERT(data->zcd_added);
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT3U(data->zcd_txg, !=, 0);
|
|
|
|
|
2010-08-26 21:43:27 +04:00
|
|
|
(void) mutex_enter(&zcl.zcl_callbacks_lock);
|
2010-08-26 21:17:18 +04:00
|
|
|
|
|
|
|
/* See if this cb was called more quickly */
|
|
|
|
if ((synced_txg - data->zcd_txg) < zc_min_txg_delay)
|
|
|
|
zc_min_txg_delay = synced_txg - data->zcd_txg;
|
|
|
|
|
|
|
|
/* Remove our callback from the list */
|
2010-05-29 00:45:14 +04:00
|
|
|
list_remove(&zcl.zcl_callbacks, data);
|
2010-08-26 21:17:18 +04:00
|
|
|
|
2010-08-26 21:43:27 +04:00
|
|
|
(void) mutex_exit(&zcl.zcl_callbacks_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
umem_free(data, sizeof (ztest_cb_data_t));
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Allocate and initialize callback data structure */
|
|
|
|
static ztest_cb_data_t *
|
|
|
|
ztest_create_cb_data(objset_t *os, uint64_t txg)
|
|
|
|
{
|
|
|
|
ztest_cb_data_t *cb_data;
|
|
|
|
|
|
|
|
cb_data = umem_zalloc(sizeof (ztest_cb_data_t), UMEM_NOFAIL);
|
|
|
|
|
|
|
|
cb_data->zcd_txg = txg;
|
|
|
|
cb_data->zcd_spa = dmu_objset_spa(os);
|
2010-08-26 21:43:27 +04:00
|
|
|
list_link_init(&cb_data->zcd_node);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
return (cb_data);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Commit callback test.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_dmu_commit_callbacks(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
|
|
|
objset_t *os = zd->zd_os;
|
2010-08-26 22:13:05 +04:00
|
|
|
ztest_od_t *od;
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_tx_t *tx;
|
|
|
|
ztest_cb_data_t *cb_data[3], *tmp_cb;
|
|
|
|
uint64_t old_txg, txg;
|
2010-08-26 21:17:18 +04:00
|
|
|
int i, error = 0;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
od = umem_alloc(sizeof (ztest_od_t), UMEM_NOFAIL);
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
ztest_od_init(od, id, FTAG, 0, DMU_OT_UINT64_OTHER, 0, 0, 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
if (ztest_object_init(zd, od, sizeof (ztest_od_t), B_FALSE) != 0) {
|
2013-11-01 23:26:11 +04:00
|
|
|
umem_free(od, sizeof (ztest_od_t));
|
2010-05-29 00:45:14 +04:00
|
|
|
return;
|
2010-08-26 22:13:05 +04:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
tx = dmu_tx_create(os);
|
|
|
|
|
|
|
|
cb_data[0] = ztest_create_cb_data(os, 0);
|
|
|
|
dmu_tx_callback_register(tx, ztest_commit_callback, cb_data[0]);
|
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
dmu_tx_hold_write(tx, od->od_object, 0, sizeof (uint64_t));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
/* Every once in a while, abort the transaction on purpose */
|
|
|
|
if (ztest_random(100) == 0)
|
|
|
|
error = -1;
|
|
|
|
|
|
|
|
if (!error)
|
|
|
|
error = dmu_tx_assign(tx, TXG_NOWAIT);
|
|
|
|
|
|
|
|
txg = error ? 0 : dmu_tx_get_txg(tx);
|
|
|
|
|
|
|
|
cb_data[0]->zcd_txg = txg;
|
|
|
|
cb_data[1] = ztest_create_cb_data(os, txg);
|
|
|
|
dmu_tx_callback_register(tx, ztest_commit_callback, cb_data[1]);
|
|
|
|
|
|
|
|
if (error) {
|
|
|
|
/*
|
|
|
|
* It's not a strict requirement to call the registered
|
|
|
|
* callbacks from inside dmu_tx_abort(), but that's what
|
|
|
|
* it's supposed to happen in the current implementation
|
|
|
|
* so we will check for that.
|
|
|
|
*/
|
|
|
|
for (i = 0; i < 2; i++) {
|
|
|
|
cb_data[i]->zcd_expected_err = ECANCELED;
|
|
|
|
VERIFY(!cb_data[i]->zcd_called);
|
|
|
|
}
|
|
|
|
|
|
|
|
dmu_tx_abort(tx);
|
|
|
|
|
|
|
|
for (i = 0; i < 2; i++) {
|
|
|
|
VERIFY(cb_data[i]->zcd_called);
|
|
|
|
umem_free(cb_data[i], sizeof (ztest_cb_data_t));
|
|
|
|
}
|
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
umem_free(od, sizeof (ztest_od_t));
|
2010-05-29 00:45:14 +04:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
cb_data[2] = ztest_create_cb_data(os, txg);
|
|
|
|
dmu_tx_callback_register(tx, ztest_commit_callback, cb_data[2]);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Read existing data to make sure there isn't a future leak.
|
|
|
|
*/
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(dmu_read(os, od->od_object, 0, sizeof (uint64_t),
|
2010-05-29 00:45:14 +04:00
|
|
|
&old_txg, DMU_READ_PREFETCH));
|
|
|
|
|
|
|
|
if (old_txg > txg)
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE,
|
|
|
|
"future leak: got %"PRIu64", open txg is %"PRIu64"",
|
2010-05-29 00:45:14 +04:00
|
|
|
old_txg, txg);
|
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
dmu_write(os, od->od_object, 0, sizeof (uint64_t), &txg, tx);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-26 21:43:27 +04:00
|
|
|
(void) mutex_enter(&zcl.zcl_callbacks_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Since commit callbacks don't have any ordering requirement and since
|
|
|
|
* it is theoretically possible for a commit callback to be called
|
|
|
|
* after an arbitrary amount of time has elapsed since its txg has been
|
|
|
|
* synced, it is difficult to reliably determine whether a commit
|
|
|
|
* callback hasn't been called due to high load or due to a flawed
|
|
|
|
* implementation.
|
|
|
|
*
|
|
|
|
* In practice, we will assume that if after a certain number of txgs a
|
|
|
|
* commit callback hasn't been called, then most likely there's an
|
|
|
|
* implementation bug..
|
|
|
|
*/
|
|
|
|
tmp_cb = list_head(&zcl.zcl_callbacks);
|
|
|
|
if (tmp_cb != NULL &&
|
2010-08-26 21:17:18 +04:00
|
|
|
tmp_cb->zcd_txg + ZTEST_COMMIT_CB_THRESH < txg) {
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE,
|
|
|
|
"Commit callback threshold exceeded, "
|
|
|
|
"oldest txg: %"PRIu64", open txg: %"PRIu64"\n",
|
|
|
|
tmp_cb->zcd_txg, txg);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Let's find the place to insert our callbacks.
|
|
|
|
*
|
|
|
|
* Even though the list is ordered by txg, it is possible for the
|
|
|
|
* insertion point to not be the end because our txg may already be
|
|
|
|
* quiescing at this point and other callbacks in the open txg
|
|
|
|
* (from other objsets) may have sneaked in.
|
|
|
|
*/
|
|
|
|
tmp_cb = list_tail(&zcl.zcl_callbacks);
|
|
|
|
while (tmp_cb != NULL && tmp_cb->zcd_txg > txg)
|
|
|
|
tmp_cb = list_prev(&zcl.zcl_callbacks, tmp_cb);
|
|
|
|
|
|
|
|
/* Add the 3 callbacks to the list */
|
|
|
|
for (i = 0; i < 3; i++) {
|
|
|
|
if (tmp_cb == NULL)
|
|
|
|
list_insert_head(&zcl.zcl_callbacks, cb_data[i]);
|
|
|
|
else
|
|
|
|
list_insert_after(&zcl.zcl_callbacks, tmp_cb,
|
|
|
|
cb_data[i]);
|
|
|
|
|
|
|
|
cb_data[i]->zcd_added = B_TRUE;
|
|
|
|
VERIFY(!cb_data[i]->zcd_called);
|
|
|
|
|
|
|
|
tmp_cb = cb_data[i];
|
|
|
|
}
|
|
|
|
|
2010-08-26 21:17:18 +04:00
|
|
|
zc_cb_counter += 3;
|
|
|
|
|
2010-08-26 21:43:27 +04:00
|
|
|
(void) mutex_exit(&zcl.zcl_callbacks_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
dmu_tx_commit(tx);
|
2010-08-26 22:13:05 +04:00
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
umem_free(od, sizeof (ztest_od_t));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
/*
|
|
|
|
* Visit each object in the dataset. Verify that its properties
|
|
|
|
* are consistent what was stored in the block tag when it was created,
|
|
|
|
* and that its unused bonus buffer space has not been overwritten.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ztest_verify_dnode_bt(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) id;
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
objset_t *os = zd->zd_os;
|
|
|
|
uint64_t obj;
|
|
|
|
int err = 0;
|
|
|
|
|
|
|
|
for (obj = 0; err == 0; err = dmu_object_next(os, &obj, FALSE, 0)) {
|
|
|
|
ztest_block_tag_t *bt = NULL;
|
|
|
|
dmu_object_info_t doi;
|
|
|
|
dmu_buf_t *db;
|
|
|
|
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
ztest_object_lock(zd, obj, ZTRL_READER);
|
2017-12-07 02:30:15 +03:00
|
|
|
if (dmu_bonus_hold(os, obj, FTAG, &db) != 0) {
|
|
|
|
ztest_object_unlock(zd, obj);
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
continue;
|
2017-12-07 02:30:15 +03:00
|
|
|
}
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
|
|
|
|
dmu_object_info_from_db(db, &doi);
|
|
|
|
if (doi.doi_bonus_size >= sizeof (*bt))
|
|
|
|
bt = ztest_bt_bonus(db);
|
|
|
|
|
|
|
|
if (bt && bt->bt_magic == BT_MAGIC) {
|
|
|
|
ztest_bt_verify(bt, os, obj, doi.doi_dnodesize,
|
|
|
|
bt->bt_offset, bt->bt_gen, bt->bt_txg,
|
|
|
|
bt->bt_crtxg);
|
|
|
|
ztest_verify_unused_bonus(db, bt, obj, os, bt->bt_gen);
|
|
|
|
}
|
|
|
|
|
|
|
|
dmu_buf_rele(db, FTAG);
|
2017-12-07 02:30:15 +03:00
|
|
|
ztest_object_unlock(zd, obj);
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
void
|
|
|
|
ztest_dsl_prop_get_set(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) id;
|
2010-05-29 00:45:14 +04:00
|
|
|
zfs_prop_t proplist[] = {
|
|
|
|
ZFS_PROP_CHECKSUM,
|
|
|
|
ZFS_PROP_COMPRESSION,
|
|
|
|
ZFS_PROP_COPIES,
|
|
|
|
ZFS_PROP_DEDUP
|
|
|
|
};
|
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_rdlock(&ztest_name_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2023-01-05 03:23:28 +03:00
|
|
|
for (int p = 0; p < sizeof (proplist) / sizeof (proplist[0]); p++) {
|
|
|
|
int error = ztest_dsl_prop_set_uint64(zd->zd_name, proplist[p],
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_random_dsl_prop(proplist[p]), (int)ztest_random(2));
|
2023-01-05 03:23:28 +03:00
|
|
|
ASSERT(error == 0 || error == ENOSPC);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2023-01-05 03:23:28 +03:00
|
|
|
int error = ztest_dsl_prop_set_uint64(zd->zd_name, ZFS_PROP_RECORDSIZE,
|
|
|
|
ztest_random_blocksize(), (int)ztest_random(2));
|
|
|
|
ASSERT(error == 0 || error == ENOSPC);
|
2015-01-29 23:45:40 +03:00
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&ztest_name_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
ztest_spa_prop_get_set(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) zd, (void) id;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_rdlock(&ztest_name_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2019-03-29 19:13:20 +03:00
|
|
|
(void) ztest_spa_prop_set_uint64(ZPOOL_PROP_AUTOTRIM, ztest_random(2));
|
|
|
|
|
2024-09-06 18:45:58 +03:00
|
|
|
nvlist_t *props = fnvlist_alloc();
|
|
|
|
|
|
|
|
VERIFY0(spa_prop_get(ztest_spa, props));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 6)
|
2010-05-29 00:45:14 +04:00
|
|
|
dump_nvlist(props, 4);
|
|
|
|
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(props);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&ztest_name_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
static int
|
|
|
|
user_release_one(const char *snapname, const char *holdname)
|
|
|
|
{
|
|
|
|
nvlist_t *snaps, *holds;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
snaps = fnvlist_alloc();
|
|
|
|
holds = fnvlist_alloc();
|
|
|
|
fnvlist_add_boolean(holds, holdname);
|
|
|
|
fnvlist_add_nvlist(snaps, snapname, holds);
|
|
|
|
fnvlist_free(holds);
|
|
|
|
error = dsl_dataset_user_release(snaps, NULL);
|
|
|
|
fnvlist_free(snaps);
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Test snapshot hold/release and deferred destroy.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ztest_dmu_snapshot_hold(ztest_ds_t *zd, uint64_t id)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
int error;
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *os = zd->zd_os;
|
|
|
|
objset_t *origin;
|
|
|
|
char snapname[100];
|
|
|
|
char fullname[100];
|
|
|
|
char clonename[100];
|
|
|
|
char tag[100];
|
2016-06-16 00:28:36 +03:00
|
|
|
char osname[ZFS_MAX_DATASET_NAME_LEN];
|
2013-09-04 16:00:57 +04:00
|
|
|
nvlist_t *holds;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_rdlock(&ztest_name_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
dmu_objset_name(os, osname);
|
|
|
|
|
2021-06-05 15:18:37 +03:00
|
|
|
(void) snprintf(snapname, sizeof (snapname), "sh1_%"PRIu64"", id);
|
2013-09-04 16:00:57 +04:00
|
|
|
(void) snprintf(fullname, sizeof (fullname), "%s@%s", osname, snapname);
|
2021-06-05 15:18:37 +03:00
|
|
|
(void) snprintf(clonename, sizeof (clonename), "%s/ch1_%"PRIu64"",
|
|
|
|
osname, id);
|
|
|
|
(void) snprintf(tag, sizeof (tag), "tag_%"PRIu64"", id);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Clean up from any previous run.
|
|
|
|
*/
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dsl_destroy_head(clonename);
|
|
|
|
if (error != ENOENT)
|
|
|
|
ASSERT0(error);
|
|
|
|
error = user_release_one(fullname, tag);
|
|
|
|
if (error != ESRCH && error != ENOENT)
|
|
|
|
ASSERT0(error);
|
|
|
|
error = dsl_destroy_snapshot(fullname, B_FALSE);
|
|
|
|
if (error != ENOENT)
|
|
|
|
ASSERT0(error);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Create snapshot, clone it, mark snap for deferred destroy,
|
|
|
|
* destroy clone, verify snap was also destroyed.
|
|
|
|
*/
|
2013-08-28 15:45:09 +04:00
|
|
|
error = dmu_objset_snapshot_one(osname, snapname);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error) {
|
|
|
|
if (error == ENOSPC) {
|
|
|
|
ztest_record_enospc("dmu_objset_snapshot");
|
|
|
|
goto out;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "dmu_objset_snapshot(%s) = %d", fullname, error);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dmu_objset_clone(clonename, fullname);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error) {
|
2008-11-20 23:01:55 +03:00
|
|
|
if (error == ENOSPC) {
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_record_enospc("dmu_objset_clone");
|
|
|
|
goto out;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "dmu_objset_clone(%s) = %d", clonename, error);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dsl_destroy_snapshot(fullname, B_TRUE);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error) {
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "dsl_destroy_snapshot(%s, B_TRUE) = %d",
|
2010-05-29 00:45:14 +04:00
|
|
|
fullname, error);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dsl_destroy_head(clonename);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error)
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "dsl_destroy_head(%s) = %d", clonename, error);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
error = dmu_objset_hold(fullname, FTAG, &origin);
|
|
|
|
if (error != ENOENT)
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "dmu_objset_hold(%s) = %d", fullname, error);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Create snapshot, add temporary hold, verify that we can't
|
|
|
|
* destroy a held snapshot, mark for deferred destroy,
|
|
|
|
* release hold, verify snapshot was destroyed.
|
|
|
|
*/
|
2013-08-28 15:45:09 +04:00
|
|
|
error = dmu_objset_snapshot_one(osname, snapname);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error) {
|
|
|
|
if (error == ENOSPC) {
|
|
|
|
ztest_record_enospc("dmu_objset_snapshot");
|
|
|
|
goto out;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "dmu_objset_snapshot(%s) = %d", fullname, error);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
holds = fnvlist_alloc();
|
|
|
|
fnvlist_add_string(holds, fullname, tag);
|
|
|
|
error = dsl_dataset_user_hold(holds, 0, NULL);
|
|
|
|
fnvlist_free(holds);
|
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
if (error == ENOSPC) {
|
|
|
|
ztest_record_enospc("dsl_dataset_user_hold");
|
|
|
|
goto out;
|
|
|
|
} else if (error) {
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "dsl_dataset_user_hold(%s, %s) = %u",
|
2014-06-06 01:19:08 +04:00
|
|
|
fullname, tag, error);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dsl_destroy_snapshot(fullname, B_FALSE);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error != EBUSY) {
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "dsl_destroy_snapshot(%s, B_FALSE) = %d",
|
2010-05-29 00:45:14 +04:00
|
|
|
fullname, error);
|
|
|
|
}
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dsl_destroy_snapshot(fullname, B_TRUE);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error) {
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "dsl_destroy_snapshot(%s, B_TRUE) = %d",
|
2010-05-29 00:45:14 +04:00
|
|
|
fullname, error);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
error = user_release_one(fullname, tag);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error)
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "user_release_one(%s, %s) = %d",
|
|
|
|
fullname, tag, error);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
VERIFY3U(dmu_objset_hold(fullname, FTAG, &origin), ==, ENOENT);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
out:
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&ztest_name_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Inject random faults into the on-disk data.
|
|
|
|
*/
|
|
|
|
void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_fault_inject(ztest_ds_t *zd, uint64_t id)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) zd, (void) id;
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_shared_t *zs = ztest_shared;
|
2012-01-24 06:43:32 +04:00
|
|
|
spa_t *spa = ztest_spa;
|
2008-11-20 23:01:55 +03:00
|
|
|
int fd;
|
|
|
|
uint64_t offset;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t leaves;
|
2010-08-26 20:52:39 +04:00
|
|
|
uint64_t bad = 0x1990c0ffeedecadeull;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t top, leaf;
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
uint64_t raidz_children;
|
2010-08-26 22:13:05 +04:00
|
|
|
char *path0;
|
|
|
|
char *pathrand;
|
2008-11-20 23:01:55 +03:00
|
|
|
size_t fsize;
|
2017-01-26 23:34:29 +03:00
|
|
|
int bshift = SPA_MAXBLOCKSHIFT + 2;
|
2008-11-20 23:01:55 +03:00
|
|
|
int iters = 1000;
|
2010-05-29 00:45:14 +04:00
|
|
|
int maxfaults;
|
|
|
|
int mirror_save;
|
2008-12-03 23:09:06 +03:00
|
|
|
vdev_t *vd0 = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t guid0 = 0;
|
2010-05-29 00:45:14 +04:00
|
|
|
boolean_t islog = B_FALSE;
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
boolean_t injected = B_FALSE;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
path0 = umem_alloc(MAXPATHLEN, UMEM_NOFAIL);
|
|
|
|
pathrand = umem_alloc(MAXPATHLEN, UMEM_NOFAIL);
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_enter(&ztest_vdev_lock);
|
OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.
There are two underlying problems:
1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.
The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).
2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.
Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.
The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).
The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.
This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).
Porting notes:
* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
to dsl_scan_needs_resilver(), which was added to ZoL as part of the
sequential scrub work.
* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
parameter. The extra parameter is unique to ZoL.
* When posting indirect checksum errors the ABD can be passed directly,
zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-02-13 22:37:56 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Device removal is in progress, fault injection must be disabled
|
|
|
|
* until it completes and the pool is scrubbed. The fault injection
|
|
|
|
* strategy for damaging blocks does not take in to account evacuated
|
|
|
|
* blocks which may have already been damaged.
|
|
|
|
*/
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
if (ztest_device_removal_active)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The fault injection strategy for damaging blocks cannot be used
|
|
|
|
* if raidz expansion is in progress. The leaves value
|
|
|
|
* (attached raidz children) is variable and strategy for damaging
|
|
|
|
* blocks will corrupt same data blocks on different child vdevs
|
|
|
|
* because of the reflow process.
|
|
|
|
*/
|
|
|
|
if (spa->spa_raidz_expand != NULL)
|
OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.
There are two underlying problems:
1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.
The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).
2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.
Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.
The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).
The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.
This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).
Porting notes:
* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
to dsl_scan_needs_resilver(), which was added to ZoL as part of the
sequential scrub work.
* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
parameter. The extra parameter is unique to ZoL.
* When posting indirect checksum errors the ABD can be passed directly,
zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-02-13 22:37:56 +03:00
|
|
|
goto out;
|
|
|
|
|
2018-02-05 23:00:26 +03:00
|
|
|
maxfaults = MAXFAULTS(zs);
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
raidz_children = ztest_get_raidz_children(spa);
|
|
|
|
leaves = MAX(zs->zs_mirrors, 1) * raidz_children;
|
2010-05-29 00:45:14 +04:00
|
|
|
mirror_save = zs->zs_mirrors;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3U(leaves, >=, 1);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2019-01-18 20:47:55 +03:00
|
|
|
/*
|
|
|
|
* While ztest is running the number of leaves will not change. This
|
|
|
|
* is critical for the fault injection logic as it determines where
|
|
|
|
* errors can be safely injected such that they are always repairable.
|
|
|
|
*
|
|
|
|
* When restarting ztest a different number of leaves may be requested
|
|
|
|
* which will shift the regions to be damaged. This is fine as long
|
|
|
|
* as the pool has been scrubbed prior to using the new mapping.
|
|
|
|
* Failure to do can result in non-repairable damage being injected.
|
|
|
|
*/
|
|
|
|
if (ztest_pool_scrubbed == B_FALSE)
|
|
|
|
goto out;
|
|
|
|
|
2013-08-07 22:24:34 +04:00
|
|
|
/*
|
|
|
|
* Grab the name lock as reader. There are some operations
|
|
|
|
* which don't like to have their vdevs changed while
|
|
|
|
* they are in progress (i.e. spa_change_guid). Those
|
|
|
|
* operations will have grabbed the name lock as writer.
|
|
|
|
*/
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_rdlock(&ztest_name_lock);
|
2013-08-07 22:24:34 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* We need SCL_STATE here because we're going to look at vd0->vdev_tsd.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_enter(spa, SCL_STATE, FTAG, RW_READER);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
if (ztest_random(2) == 0) {
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Inject errors on a normal data device or slog device.
|
2008-12-03 23:09:06 +03:00
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
top = ztest_random_vdev_top(spa, B_TRUE);
|
|
|
|
leaf = ztest_random(leaves) + zs->zs_splits;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* Generate paths to the first leaf in this top-level vdev,
|
|
|
|
* and to the random leaf we selected. We'll induce transient
|
|
|
|
* write failures and random online/offline activity on leaf 0,
|
|
|
|
* and we'll write random garbage to the randomly chosen leaf.
|
|
|
|
*/
|
Fix ztest vdev file paths.
Currently, in several instances (but not all), ztest generates vdev
file paths using a statement similar to this:
snprintf(path, sizeof (path), ztest_dev_template, ...);
This worked fine until 40b84e7aec6392187722e61e5a4a853b530bf60f, which
changed path to be a pointer to the heap instead of an array allocated
on the stack. Before this change, sizeof(path) would return the size of
the array; now, it returns the size of the pointer instead.
As a result, the aforementioned sprintf statement uses the wrong size
and truncates the vdev file path to the first 4 or 8 bytes (depending
on the architecture). Typically, with default settings, the file path
will become "/tmp/zt" instead of "/test/ztest.XXX".
This issue only exists in ztest_vdev_attach_detach() and
ztest_fault_inject(), which explains why ztest doesn't fail right away.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #989
2012-10-01 18:34:52 +04:00
|
|
|
(void) snprintf(path0, MAXPATHLEN, ztest_dev_template,
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_opts.zo_dir, ztest_opts.zo_pool,
|
|
|
|
top * leaves + zs->zs_splits);
|
Fix ztest vdev file paths.
Currently, in several instances (but not all), ztest generates vdev
file paths using a statement similar to this:
snprintf(path, sizeof (path), ztest_dev_template, ...);
This worked fine until 40b84e7aec6392187722e61e5a4a853b530bf60f, which
changed path to be a pointer to the heap instead of an array allocated
on the stack. Before this change, sizeof(path) would return the size of
the array; now, it returns the size of the pointer instead.
As a result, the aforementioned sprintf statement uses the wrong size
and truncates the vdev file path to the first 4 or 8 bytes (depending
on the architecture). Typically, with default settings, the file path
will become "/tmp/zt" instead of "/test/ztest.XXX".
This issue only exists in ztest_vdev_attach_detach() and
ztest_fault_inject(), which explains why ztest doesn't fail right away.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #989
2012-10-01 18:34:52 +04:00
|
|
|
(void) snprintf(pathrand, MAXPATHLEN, ztest_dev_template,
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_opts.zo_dir, ztest_opts.zo_pool,
|
|
|
|
top * leaves + leaf);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
vd0 = vdev_lookup_by_path(spa->spa_root_vdev, path0);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (vd0 != NULL && vd0->vdev_top->vdev_islog)
|
|
|
|
islog = B_TRUE;
|
|
|
|
|
2013-08-07 22:24:34 +04:00
|
|
|
/*
|
|
|
|
* If the top-level vdev needs to be resilvered
|
|
|
|
* then we only allow faults on the device that is
|
|
|
|
* resilvering.
|
|
|
|
*/
|
|
|
|
if (vd0 != NULL && maxfaults != 1 &&
|
|
|
|
(!vdev_resilver_needed(vd0->vdev_top, NULL, NULL) ||
|
2013-08-08 00:16:22 +04:00
|
|
|
vd0->vdev_resilver_txg != 0)) {
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* Make vd0 explicitly claim to be unreadable,
|
2021-04-03 04:38:53 +03:00
|
|
|
* or unwritable, or reach behind its back
|
2008-12-03 23:09:06 +03:00
|
|
|
* and close the underlying fd. We can do this if
|
|
|
|
* maxfaults == 0 because we'll fail and reexecute,
|
|
|
|
* and we can do it if maxfaults >= 2 because we'll
|
|
|
|
* have enough redundancy. If maxfaults == 1, the
|
|
|
|
* combination of this with injection of random data
|
|
|
|
* corruption below exceeds the pool's fault tolerance.
|
|
|
|
*/
|
|
|
|
vdev_file_t *vf = vd0->vdev_tsd;
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
zfs_dbgmsg("injecting fault to vdev %llu; maxfaults=%d",
|
|
|
|
(long long)vd0->vdev_id, (int)maxfaults);
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
if (vf != NULL && ztest_random(3) == 0) {
|
2019-11-21 20:32:57 +03:00
|
|
|
(void) close(vf->vf_file->f_fd);
|
|
|
|
vf->vf_file->f_fd = -1;
|
2008-12-03 23:09:06 +03:00
|
|
|
} else if (ztest_random(2) == 0) {
|
|
|
|
vd0->vdev_cant_read = B_TRUE;
|
|
|
|
} else {
|
|
|
|
vd0->vdev_cant_write = B_TRUE;
|
|
|
|
}
|
|
|
|
guid0 = vd0->vdev_guid;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Inject errors on an l2cache device.
|
|
|
|
*/
|
|
|
|
spa_aux_vdev_t *sav = &spa->spa_l2cache;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
if (sav->sav_count == 0) {
|
|
|
|
spa_config_exit(spa, SCL_STATE, FTAG);
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&ztest_name_lock);
|
2010-08-26 22:13:05 +04:00
|
|
|
goto out;
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
|
|
|
vd0 = sav->sav_vdevs[ztest_random(sav->sav_count)];
|
2008-11-20 23:01:55 +03:00
|
|
|
guid0 = vd0->vdev_guid;
|
Fix unsafe string operations
Coverity caught unsafe use of `strcpy()` in `ztest_dmu_objset_own()`,
`nfs_init_tmpfile()` and `dump_snapshot()`. It also caught an unsafe use
of `strlcat()` in `nfs_init_tmpfile()`.
Inspired by this, I did an audit of every single usage of `strcpy()` and
`strcat()` in the code. If I could not prove that the usage was safe, I
changed the code to use either `strlcpy()` or `strlcat()`, depending on
which function was originally used. In some cases, `snprintf()` was used
to replace multiple uses of `strcat` because it was cleaner.
Whenever I changed a function, I preferred to use `sizeof(dst)` when the
compiler is able to provide the string size via that. When it could not
because the string was passed by a caller, I checked the entire call
tree of the function to find out how big the buffer was and hard coded
it. Hardcoding is less than ideal, but it is safe unless someone shrinks
the buffer sizes being passed.
Additionally, Coverity reported three more string related issues:
* It caught a case where we do an overlapping memory copy in a call to
`snprintf()`. We fix that via `kmem_strdup()` and `kmem_strfree()`.
* It caught `sizeof (buf)` being used instead of `buflen` in
`zdb_nicenum()`'s call to `zfs_nicenum()`, which is passed to
`snprintf()`. We change that to pass `buflen`.
* It caught a theoretical unterminated string passed to `strcmp()`.
This one is likely a false positive, but we have the information
needed to do this more safely, so we change this to silence the false
positive not just in coverity, but potentially other static analysis
tools too. We switch to `strncmp()`.
* There was a false positive in tests/zfs-tests/cmd/dir_rd_update.c. We
suppress it by switching to `snprintf()` since other static analysis
tools might complain about it too. Interestingly, there is a possible
real bug there too, since it assumes that the passed directory path
ends with '/'. We add a '/' to fix that potential bug.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13913
2022-09-28 02:47:24 +03:00
|
|
|
(void) strlcpy(path0, vd0->vdev_path, MAXPATHLEN);
|
|
|
|
(void) strlcpy(pathrand, vd0->vdev_path, MAXPATHLEN);
|
2008-12-03 23:09:06 +03:00
|
|
|
|
|
|
|
leaf = 0;
|
|
|
|
leaves = 1;
|
|
|
|
maxfaults = INT_MAX; /* no limit on cache devices */
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_exit(spa, SCL_STATE, FTAG);
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&ztest_name_lock);
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* If we can tolerate two or more faults, or we're dealing
|
|
|
|
* with a slog, randomly online/offline vd0.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
if ((maxfaults >= 2 || islog) && guid0 != 0) {
|
2009-01-16 00:59:39 +03:00
|
|
|
if (ztest_random(10) < 6) {
|
|
|
|
int flags = (ztest_random(2) == 0 ?
|
|
|
|
ZFS_OFFLINE_TEMPORARY : 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We have to grab the zs_name_lock as writer to
|
|
|
|
* prevent a race between offlining a slog and
|
|
|
|
* destroying a dataset. Offlining the slog will
|
|
|
|
* grab a reference on the dataset which may cause
|
2013-09-04 16:00:57 +04:00
|
|
|
* dsl_destroy_head() to fail with EBUSY thus
|
2010-05-29 00:45:14 +04:00
|
|
|
* leaving the dataset in an inconsistent state.
|
|
|
|
*/
|
|
|
|
if (islog)
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_wrlock(&ztest_name_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY3U(vdev_offline(spa, guid0, flags), !=, EBUSY);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (islog)
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&ztest_name_lock);
|
2009-01-16 00:59:39 +03:00
|
|
|
} else {
|
2012-12-22 02:57:09 +04:00
|
|
|
/*
|
|
|
|
* Ideally we would like to be able to randomly
|
|
|
|
* call vdev_[on|off]line without holding locks
|
|
|
|
* to force unpredictable failures but the side
|
|
|
|
* effects of vdev_[on|off]line prevent us from
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
* doing so.
|
2012-12-22 02:57:09 +04:00
|
|
|
*/
|
2009-01-16 00:59:39 +03:00
|
|
|
(void) vdev_online(spa, guid0, 0, NULL);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (maxfaults == 0)
|
2010-08-26 22:13:05 +04:00
|
|
|
goto out;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* We have at least single-fault tolerance, so inject data corruption.
|
|
|
|
*/
|
|
|
|
fd = open(pathrand, O_RDWR);
|
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
if (fd == -1) /* we hit a gap in the device namespace */
|
2010-08-26 22:13:05 +04:00
|
|
|
goto out;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
fsize = lseek(fd, 0, SEEK_END);
|
|
|
|
|
|
|
|
while (--iters != 0) {
|
2016-01-23 04:06:14 +03:00
|
|
|
/*
|
|
|
|
* The offset must be chosen carefully to ensure that
|
|
|
|
* we do not inject a given logical block with errors
|
|
|
|
* on two different leaf devices, because ZFS can not
|
|
|
|
* tolerate that (if maxfaults==1).
|
|
|
|
*
|
2020-06-22 19:48:36 +03:00
|
|
|
* To achieve this we divide each leaf device into
|
|
|
|
* chunks of size (# leaves * SPA_MAXBLOCKSIZE * 4).
|
|
|
|
* Each chunk is further divided into error-injection
|
|
|
|
* ranges (can accept errors) and clear ranges (we do
|
|
|
|
* not inject errors in those). Each error-injection
|
|
|
|
* range can accept errors only for a single leaf vdev.
|
|
|
|
* Error-injection ranges are separated by clear ranges.
|
2016-01-23 04:06:14 +03:00
|
|
|
*
|
|
|
|
* For example, with 3 leaves, each chunk looks like:
|
|
|
|
* 0 to 32M: injection range for leaf 0
|
2020-06-22 19:48:36 +03:00
|
|
|
* 32M to 64M: clear range - no injection allowed
|
2016-01-23 04:06:14 +03:00
|
|
|
* 64M to 96M: injection range for leaf 1
|
2020-06-22 19:48:36 +03:00
|
|
|
* 96M to 128M: clear range - no injection allowed
|
2016-01-23 04:06:14 +03:00
|
|
|
* 128M to 160M: injection range for leaf 2
|
2020-06-22 19:48:36 +03:00
|
|
|
* 160M to 192M: clear range - no injection allowed
|
|
|
|
*
|
|
|
|
* Each clear range must be large enough such that a
|
|
|
|
* single block cannot straddle it. This way a block
|
|
|
|
* can't be a target in two different injection ranges
|
|
|
|
* (on different leaf vdevs).
|
2016-01-23 04:06:14 +03:00
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
offset = ztest_random(fsize / (leaves << bshift)) *
|
|
|
|
(leaves << bshift) + (leaf << bshift) +
|
|
|
|
(ztest_random(1ULL << (bshift - 1)) & -8ULL);
|
|
|
|
|
2017-01-26 23:34:29 +03:00
|
|
|
/*
|
|
|
|
* Only allow damage to the labels at one end of the vdev.
|
|
|
|
*
|
|
|
|
* If all labels are damaged, the device will be totally
|
|
|
|
* inaccessible, which will result in loss of data,
|
|
|
|
* because we also damage (parts of) the other side of
|
|
|
|
* the mirror/raidz.
|
|
|
|
*
|
|
|
|
* Additionally, we will always have both an even and an
|
|
|
|
* odd label, so that we can handle crashes in the
|
|
|
|
* middle of vdev_config_sync().
|
|
|
|
*/
|
|
|
|
if ((leaf & 1) == 0 && offset < VDEV_LABEL_START_SIZE)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The two end labels are stored at the "end" of the disk, but
|
|
|
|
* the end of the disk (vdev_psize) is aligned to
|
|
|
|
* sizeof (vdev_label_t).
|
|
|
|
*/
|
2024-05-10 18:47:21 +03:00
|
|
|
uint64_t psize = P2ALIGN_TYPED(fsize, sizeof (vdev_label_t),
|
|
|
|
uint64_t);
|
2017-01-26 23:34:29 +03:00
|
|
|
if ((leaf & 1) == 1 &&
|
|
|
|
offset + sizeof (bad) > psize - VDEV_LABEL_END_SIZE)
|
2008-11-20 23:01:55 +03:00
|
|
|
continue;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (mirror_save != zs->zs_mirrors) {
|
|
|
|
(void) close(fd);
|
2010-08-26 22:13:05 +04:00
|
|
|
goto out;
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (pwrite(fd, &bad, sizeof (bad), offset) != sizeof (bad))
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_TRUE,
|
|
|
|
"can't inject bad word at 0x%"PRIx64" in %s",
|
2008-11-20 23:01:55 +03:00
|
|
|
offset, pathrand);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 7)
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("injected bad word into %s,"
|
2021-06-05 15:18:37 +03:00
|
|
|
" offset 0x%"PRIx64"\n", pathrand, offset);
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
|
|
|
|
injected = B_TRUE;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
(void) close(fd);
|
2010-08-26 22:13:05 +04:00
|
|
|
out:
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
mutex_exit(&ztest_vdev_lock);
|
|
|
|
|
|
|
|
if (injected && ztest_opts.zo_raid_do_expand) {
|
|
|
|
int error = spa_scan(spa, POOL_SCAN_SCRUB);
|
|
|
|
if (error == 0) {
|
|
|
|
while (dsl_scan_scrubbing(spa_get_dsl(spa)))
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2010-08-26 22:13:05 +04:00
|
|
|
umem_free(path0, MAXPATHLEN);
|
|
|
|
umem_free(pathrand, MAXPATHLEN);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2019-01-18 20:47:55 +03:00
|
|
|
/*
|
|
|
|
* By design ztest will never inject uncorrectable damage in to the pool.
|
|
|
|
* Issue a scrub, wait for it to complete, and verify there is never any
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
* persistent damage.
|
2019-01-18 20:47:55 +03:00
|
|
|
*
|
|
|
|
* Only after a full scrub has been completed is it safe to start injecting
|
|
|
|
* data corruption. See the comment in zfs_fault_inject().
|
2024-10-09 06:41:17 +03:00
|
|
|
*
|
|
|
|
* EBUSY may be returned for the following six cases. It's the callers
|
|
|
|
* responsibility to handle them accordingly.
|
|
|
|
*
|
|
|
|
* Current state Requested
|
|
|
|
* 1. Normal Scrub Running Normal Scrub or Error Scrub
|
|
|
|
* 2. Normal Scrub Paused Error Scrub
|
|
|
|
* 3. Normal Scrub Paused Pause Normal Scrub
|
|
|
|
* 4. Error Scrub Running Normal Scrub or Error Scrub
|
|
|
|
* 5. Error Scrub Paused Pause Error Scrub
|
|
|
|
* 6. Resilvering Anything else
|
2019-01-18 20:47:55 +03:00
|
|
|
*/
|
|
|
|
static int
|
|
|
|
ztest_scrub_impl(spa_t *spa)
|
|
|
|
{
|
|
|
|
int error = spa_scan(spa, POOL_SCAN_SCRUB);
|
|
|
|
if (error)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
while (dsl_scan_scrubbing(spa_get_dsl(spa)))
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
|
2022-12-22 22:48:49 +03:00
|
|
|
if (spa_approx_errlog_size(spa) > 0)
|
2019-01-18 20:47:55 +03:00
|
|
|
return (ECKSUM);
|
|
|
|
|
|
|
|
ztest_pool_scrubbed = B_TRUE;
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Scrub the pool.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
void
|
|
|
|
ztest_scrub(ztest_ds_t *zd, uint64_t id)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) zd, (void) id;
|
2012-01-24 06:43:32 +04:00
|
|
|
spa_t *spa = ztest_spa;
|
2019-01-18 20:47:55 +03:00
|
|
|
int error;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.
There are two underlying problems:
1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.
The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).
2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.
Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.
The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).
The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.
This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).
Porting notes:
* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
to dsl_scan_needs_resilver(), which was added to ZoL as part of the
sequential scrub work.
* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
parameter. The extra parameter is unique to ZoL.
* When posting indirect checksum errors the ABD can be passed directly,
zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-02-13 22:37:56 +03:00
|
|
|
/*
|
|
|
|
* Scrub in progress by device removal.
|
|
|
|
*/
|
|
|
|
if (ztest_device_removal_active)
|
|
|
|
return;
|
|
|
|
|
2019-01-18 20:47:55 +03:00
|
|
|
/*
|
|
|
|
* Start a scrub, wait a moment, then force a restart.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) spa_scan(spa, POOL_SCAN_SCRUB);
|
2019-01-18 20:47:55 +03:00
|
|
|
(void) poll(NULL, 0, 100);
|
|
|
|
|
|
|
|
error = ztest_scrub_impl(spa);
|
|
|
|
if (error == EBUSY)
|
|
|
|
error = 0;
|
|
|
|
ASSERT0(error);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2011-11-12 02:07:54 +04:00
|
|
|
/*
|
|
|
|
* Change the guid for the pool.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ztest_reguid(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) zd, (void) id;
|
2012-01-24 06:43:32 +04:00
|
|
|
spa_t *spa = ztest_spa;
|
2011-11-12 02:07:54 +04:00
|
|
|
uint64_t orig, load;
|
2012-12-15 00:38:04 +04:00
|
|
|
int error;
|
2023-09-23 02:08:51 +03:00
|
|
|
ztest_shared_t *zs = ztest_shared;
|
2011-11-12 02:07:54 +04:00
|
|
|
|
2017-09-23 19:28:18 +03:00
|
|
|
if (ztest_opts.zo_mmp_test)
|
|
|
|
return;
|
|
|
|
|
2011-11-12 02:07:54 +04:00
|
|
|
orig = spa_guid(spa);
|
|
|
|
load = spa_load_guid(spa);
|
2012-12-15 00:38:04 +04:00
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_wrlock(&ztest_name_lock);
|
2024-08-26 19:27:24 +03:00
|
|
|
error = spa_change_guid(spa, NULL);
|
2023-09-23 02:08:51 +03:00
|
|
|
zs->zs_guid = spa_guid(spa);
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&ztest_name_lock);
|
2012-12-15 00:38:04 +04:00
|
|
|
|
|
|
|
if (error != 0)
|
2011-11-12 02:07:54 +04:00
|
|
|
return;
|
|
|
|
|
2012-12-15 04:28:49 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 4) {
|
2021-06-05 15:18:37 +03:00
|
|
|
(void) printf("Changed guid old %"PRIu64" -> %"PRIu64"\n",
|
|
|
|
orig, spa_guid(spa));
|
2011-11-12 02:07:54 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
VERIFY3U(orig, !=, spa_guid(spa));
|
|
|
|
VERIFY3U(load, ==, spa_load_guid(spa));
|
|
|
|
}
|
|
|
|
|
Introduce BLAKE3 checksums as an OpenZFS feature
This commit adds BLAKE3 checksums to OpenZFS, it has similar
performance to Edon-R, but without the caveats around the latter.
Homepage of BLAKE3: https://github.com/BLAKE3-team/BLAKE3
Wikipedia: https://en.wikipedia.org/wiki/BLAKE_(hash_function)#BLAKE3
Short description of Wikipedia:
BLAKE3 is a cryptographic hash function based on Bao and BLAKE2,
created by Jack O'Connor, Jean-Philippe Aumasson, Samuel Neves, and
Zooko Wilcox-O'Hearn. It was announced on January 9, 2020, at Real
World Crypto. BLAKE3 is a single algorithm with many desirable
features (parallelism, XOF, KDF, PRF and MAC), in contrast to BLAKE
and BLAKE2, which are algorithm families with multiple variants.
BLAKE3 has a binary tree structure, so it supports a practically
unlimited degree of parallelism (both SIMD and multithreading) given
enough input. The official Rust and C implementations are
dual-licensed as public domain (CC0) and the Apache License.
Along with adding the BLAKE3 hash into the OpenZFS infrastructure a
new benchmarking file called chksum_bench was introduced. When read
it reports the speed of the available checksum functions.
On Linux: cat /proc/spl/kstat/zfs/chksum_bench
On FreeBSD: sysctl kstat.zfs.misc.chksum_bench
This is an example output of an i3-1005G1 test system with Debian 11:
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1196 1602 1761 1749 1762 1759 1751
skein-generic 546 591 608 615 619 612 616
sha256-generic 240 300 316 314 304 285 276
sha512-generic 353 441 467 476 472 467 426
blake3-generic 308 313 313 313 312 313 312
blake3-sse2 402 1289 1423 1446 1432 1458 1413
blake3-sse41 427 1470 1625 1704 1679 1607 1629
blake3-avx2 428 1920 3095 3343 3356 3318 3204
blake3-avx512 473 2687 4905 5836 5844 5643 5374
Output on Debian 5.10.0-10-amd64 system: (Ryzen 7 5800X)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1840 2458 2665 2719 2711 2723 2693
skein-generic 870 966 996 992 1003 1005 1009
sha256-generic 415 442 453 455 457 457 457
sha512-generic 608 690 711 718 719 720 721
blake3-generic 301 313 311 309 309 310 310
blake3-sse2 343 1865 2124 2188 2180 2181 2186
blake3-sse41 364 2091 2396 2509 2463 2482 2488
blake3-avx2 365 2590 4399 4971 4915 4802 4764
Output on Debian 5.10.0-9-powerpc64le system: (POWER 9)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1213 1703 1889 1918 1957 1902 1907
skein-generic 434 492 520 522 511 525 525
sha256-generic 167 183 187 188 188 187 188
sha512-generic 186 216 222 221 225 224 224
blake3-generic 153 152 154 153 151 153 153
blake3-sse2 391 1170 1366 1406 1428 1426 1414
blake3-sse41 352 1049 1212 1174 1262 1258 1259
Output on Debian 5.10.0-11-arm64 system: (Pi400)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 487 603 629 639 643 641 641
skein-generic 271 299 303 308 309 309 307
sha256-generic 117 127 128 130 130 129 130
sha512-generic 145 165 170 172 173 174 175
blake3-generic 81 29 71 89 89 89 89
blake3-sse2 112 323 368 379 380 371 374
blake3-sse41 101 315 357 368 369 364 360
Structurally, the new code is mainly split into these parts:
- 1x cross platform generic c variant: blake3_generic.c
- 4x assembly for X86-64 (SSE2, SSE4.1, AVX2, AVX512)
- 2x assembly for ARMv8 (NEON converted from SSE2)
- 2x assembly for PPC64-LE (POWER8 converted from SSE2)
- one file for switching between the implementations
Note the PPC64 assembly requires the VSX instruction set and the
kfpu_begin() / kfpu_end() calls on PowerPC were updated accordingly.
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Co-authored-by: Rich Ercolani <rincebrain@gmail.com>
Closes #10058
Closes #12918
2022-06-09 01:55:57 +03:00
|
|
|
void
|
|
|
|
ztest_blake3(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
|
|
|
(void) zd, (void) id;
|
|
|
|
hrtime_t end = gethrtime() + NANOSEC;
|
|
|
|
zio_cksum_salt_t salt;
|
|
|
|
void *salt_ptr = &salt.zcs_bytes;
|
|
|
|
struct abd *abd_data, *abd_meta;
|
|
|
|
void *buf, *templ;
|
|
|
|
int i, *ptr;
|
|
|
|
uint32_t size;
|
|
|
|
BLAKE3_CTX ctx;
|
2023-02-27 18:14:37 +03:00
|
|
|
const zfs_impl_t *blake3 = zfs_impl_get_ops("blake3");
|
Introduce BLAKE3 checksums as an OpenZFS feature
This commit adds BLAKE3 checksums to OpenZFS, it has similar
performance to Edon-R, but without the caveats around the latter.
Homepage of BLAKE3: https://github.com/BLAKE3-team/BLAKE3
Wikipedia: https://en.wikipedia.org/wiki/BLAKE_(hash_function)#BLAKE3
Short description of Wikipedia:
BLAKE3 is a cryptographic hash function based on Bao and BLAKE2,
created by Jack O'Connor, Jean-Philippe Aumasson, Samuel Neves, and
Zooko Wilcox-O'Hearn. It was announced on January 9, 2020, at Real
World Crypto. BLAKE3 is a single algorithm with many desirable
features (parallelism, XOF, KDF, PRF and MAC), in contrast to BLAKE
and BLAKE2, which are algorithm families with multiple variants.
BLAKE3 has a binary tree structure, so it supports a practically
unlimited degree of parallelism (both SIMD and multithreading) given
enough input. The official Rust and C implementations are
dual-licensed as public domain (CC0) and the Apache License.
Along with adding the BLAKE3 hash into the OpenZFS infrastructure a
new benchmarking file called chksum_bench was introduced. When read
it reports the speed of the available checksum functions.
On Linux: cat /proc/spl/kstat/zfs/chksum_bench
On FreeBSD: sysctl kstat.zfs.misc.chksum_bench
This is an example output of an i3-1005G1 test system with Debian 11:
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1196 1602 1761 1749 1762 1759 1751
skein-generic 546 591 608 615 619 612 616
sha256-generic 240 300 316 314 304 285 276
sha512-generic 353 441 467 476 472 467 426
blake3-generic 308 313 313 313 312 313 312
blake3-sse2 402 1289 1423 1446 1432 1458 1413
blake3-sse41 427 1470 1625 1704 1679 1607 1629
blake3-avx2 428 1920 3095 3343 3356 3318 3204
blake3-avx512 473 2687 4905 5836 5844 5643 5374
Output on Debian 5.10.0-10-amd64 system: (Ryzen 7 5800X)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1840 2458 2665 2719 2711 2723 2693
skein-generic 870 966 996 992 1003 1005 1009
sha256-generic 415 442 453 455 457 457 457
sha512-generic 608 690 711 718 719 720 721
blake3-generic 301 313 311 309 309 310 310
blake3-sse2 343 1865 2124 2188 2180 2181 2186
blake3-sse41 364 2091 2396 2509 2463 2482 2488
blake3-avx2 365 2590 4399 4971 4915 4802 4764
Output on Debian 5.10.0-9-powerpc64le system: (POWER 9)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1213 1703 1889 1918 1957 1902 1907
skein-generic 434 492 520 522 511 525 525
sha256-generic 167 183 187 188 188 187 188
sha512-generic 186 216 222 221 225 224 224
blake3-generic 153 152 154 153 151 153 153
blake3-sse2 391 1170 1366 1406 1428 1426 1414
blake3-sse41 352 1049 1212 1174 1262 1258 1259
Output on Debian 5.10.0-11-arm64 system: (Pi400)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 487 603 629 639 643 641 641
skein-generic 271 299 303 308 309 309 307
sha256-generic 117 127 128 130 130 129 130
sha512-generic 145 165 170 172 173 174 175
blake3-generic 81 29 71 89 89 89 89
blake3-sse2 112 323 368 379 380 371 374
blake3-sse41 101 315 357 368 369 364 360
Structurally, the new code is mainly split into these parts:
- 1x cross platform generic c variant: blake3_generic.c
- 4x assembly for X86-64 (SSE2, SSE4.1, AVX2, AVX512)
- 2x assembly for ARMv8 (NEON converted from SSE2)
- 2x assembly for PPC64-LE (POWER8 converted from SSE2)
- one file for switching between the implementations
Note the PPC64 assembly requires the VSX instruction set and the
kfpu_begin() / kfpu_end() calls on PowerPC were updated accordingly.
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Co-authored-by: Rich Ercolani <rincebrain@gmail.com>
Closes #10058
Closes #12918
2022-06-09 01:55:57 +03:00
|
|
|
|
|
|
|
size = ztest_random_blocksize();
|
|
|
|
buf = umem_alloc(size, UMEM_NOFAIL);
|
|
|
|
abd_data = abd_alloc(size, B_FALSE);
|
|
|
|
abd_meta = abd_alloc(size, B_TRUE);
|
|
|
|
|
|
|
|
for (i = 0, ptr = buf; i < size / sizeof (*ptr); i++, ptr++)
|
|
|
|
*ptr = ztest_random(UINT_MAX);
|
|
|
|
memset(salt_ptr, 'A', 32);
|
|
|
|
|
|
|
|
abd_copy_from_buf_off(abd_data, buf, 0, size);
|
|
|
|
abd_copy_from_buf_off(abd_meta, buf, 0, size);
|
|
|
|
|
|
|
|
while (gethrtime() <= end) {
|
|
|
|
int run_count = 100;
|
|
|
|
zio_cksum_t zc_ref1, zc_ref2;
|
|
|
|
zio_cksum_t zc_res1, zc_res2;
|
|
|
|
|
|
|
|
void *ref1 = &zc_ref1;
|
|
|
|
void *ref2 = &zc_ref2;
|
|
|
|
void *res1 = &zc_res1;
|
|
|
|
void *res2 = &zc_res2;
|
|
|
|
|
|
|
|
/* BLAKE3_KEY_LEN = 32 */
|
2023-02-27 18:14:37 +03:00
|
|
|
VERIFY0(blake3->setname("generic"));
|
Introduce BLAKE3 checksums as an OpenZFS feature
This commit adds BLAKE3 checksums to OpenZFS, it has similar
performance to Edon-R, but without the caveats around the latter.
Homepage of BLAKE3: https://github.com/BLAKE3-team/BLAKE3
Wikipedia: https://en.wikipedia.org/wiki/BLAKE_(hash_function)#BLAKE3
Short description of Wikipedia:
BLAKE3 is a cryptographic hash function based on Bao and BLAKE2,
created by Jack O'Connor, Jean-Philippe Aumasson, Samuel Neves, and
Zooko Wilcox-O'Hearn. It was announced on January 9, 2020, at Real
World Crypto. BLAKE3 is a single algorithm with many desirable
features (parallelism, XOF, KDF, PRF and MAC), in contrast to BLAKE
and BLAKE2, which are algorithm families with multiple variants.
BLAKE3 has a binary tree structure, so it supports a practically
unlimited degree of parallelism (both SIMD and multithreading) given
enough input. The official Rust and C implementations are
dual-licensed as public domain (CC0) and the Apache License.
Along with adding the BLAKE3 hash into the OpenZFS infrastructure a
new benchmarking file called chksum_bench was introduced. When read
it reports the speed of the available checksum functions.
On Linux: cat /proc/spl/kstat/zfs/chksum_bench
On FreeBSD: sysctl kstat.zfs.misc.chksum_bench
This is an example output of an i3-1005G1 test system with Debian 11:
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1196 1602 1761 1749 1762 1759 1751
skein-generic 546 591 608 615 619 612 616
sha256-generic 240 300 316 314 304 285 276
sha512-generic 353 441 467 476 472 467 426
blake3-generic 308 313 313 313 312 313 312
blake3-sse2 402 1289 1423 1446 1432 1458 1413
blake3-sse41 427 1470 1625 1704 1679 1607 1629
blake3-avx2 428 1920 3095 3343 3356 3318 3204
blake3-avx512 473 2687 4905 5836 5844 5643 5374
Output on Debian 5.10.0-10-amd64 system: (Ryzen 7 5800X)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1840 2458 2665 2719 2711 2723 2693
skein-generic 870 966 996 992 1003 1005 1009
sha256-generic 415 442 453 455 457 457 457
sha512-generic 608 690 711 718 719 720 721
blake3-generic 301 313 311 309 309 310 310
blake3-sse2 343 1865 2124 2188 2180 2181 2186
blake3-sse41 364 2091 2396 2509 2463 2482 2488
blake3-avx2 365 2590 4399 4971 4915 4802 4764
Output on Debian 5.10.0-9-powerpc64le system: (POWER 9)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1213 1703 1889 1918 1957 1902 1907
skein-generic 434 492 520 522 511 525 525
sha256-generic 167 183 187 188 188 187 188
sha512-generic 186 216 222 221 225 224 224
blake3-generic 153 152 154 153 151 153 153
blake3-sse2 391 1170 1366 1406 1428 1426 1414
blake3-sse41 352 1049 1212 1174 1262 1258 1259
Output on Debian 5.10.0-11-arm64 system: (Pi400)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 487 603 629 639 643 641 641
skein-generic 271 299 303 308 309 309 307
sha256-generic 117 127 128 130 130 129 130
sha512-generic 145 165 170 172 173 174 175
blake3-generic 81 29 71 89 89 89 89
blake3-sse2 112 323 368 379 380 371 374
blake3-sse41 101 315 357 368 369 364 360
Structurally, the new code is mainly split into these parts:
- 1x cross platform generic c variant: blake3_generic.c
- 4x assembly for X86-64 (SSE2, SSE4.1, AVX2, AVX512)
- 2x assembly for ARMv8 (NEON converted from SSE2)
- 2x assembly for PPC64-LE (POWER8 converted from SSE2)
- one file for switching between the implementations
Note the PPC64 assembly requires the VSX instruction set and the
kfpu_begin() / kfpu_end() calls on PowerPC were updated accordingly.
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Co-authored-by: Rich Ercolani <rincebrain@gmail.com>
Closes #10058
Closes #12918
2022-06-09 01:55:57 +03:00
|
|
|
templ = abd_checksum_blake3_tmpl_init(&salt);
|
|
|
|
Blake3_InitKeyed(&ctx, salt_ptr);
|
|
|
|
Blake3_Update(&ctx, buf, size);
|
|
|
|
Blake3_Final(&ctx, ref1);
|
|
|
|
zc_ref2 = zc_ref1;
|
|
|
|
ZIO_CHECKSUM_BSWAP(&zc_ref2);
|
|
|
|
abd_checksum_blake3_tmpl_free(templ);
|
|
|
|
|
2023-02-27 18:14:37 +03:00
|
|
|
VERIFY0(blake3->setname("cycle"));
|
Introduce BLAKE3 checksums as an OpenZFS feature
This commit adds BLAKE3 checksums to OpenZFS, it has similar
performance to Edon-R, but without the caveats around the latter.
Homepage of BLAKE3: https://github.com/BLAKE3-team/BLAKE3
Wikipedia: https://en.wikipedia.org/wiki/BLAKE_(hash_function)#BLAKE3
Short description of Wikipedia:
BLAKE3 is a cryptographic hash function based on Bao and BLAKE2,
created by Jack O'Connor, Jean-Philippe Aumasson, Samuel Neves, and
Zooko Wilcox-O'Hearn. It was announced on January 9, 2020, at Real
World Crypto. BLAKE3 is a single algorithm with many desirable
features (parallelism, XOF, KDF, PRF and MAC), in contrast to BLAKE
and BLAKE2, which are algorithm families with multiple variants.
BLAKE3 has a binary tree structure, so it supports a practically
unlimited degree of parallelism (both SIMD and multithreading) given
enough input. The official Rust and C implementations are
dual-licensed as public domain (CC0) and the Apache License.
Along with adding the BLAKE3 hash into the OpenZFS infrastructure a
new benchmarking file called chksum_bench was introduced. When read
it reports the speed of the available checksum functions.
On Linux: cat /proc/spl/kstat/zfs/chksum_bench
On FreeBSD: sysctl kstat.zfs.misc.chksum_bench
This is an example output of an i3-1005G1 test system with Debian 11:
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1196 1602 1761 1749 1762 1759 1751
skein-generic 546 591 608 615 619 612 616
sha256-generic 240 300 316 314 304 285 276
sha512-generic 353 441 467 476 472 467 426
blake3-generic 308 313 313 313 312 313 312
blake3-sse2 402 1289 1423 1446 1432 1458 1413
blake3-sse41 427 1470 1625 1704 1679 1607 1629
blake3-avx2 428 1920 3095 3343 3356 3318 3204
blake3-avx512 473 2687 4905 5836 5844 5643 5374
Output on Debian 5.10.0-10-amd64 system: (Ryzen 7 5800X)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1840 2458 2665 2719 2711 2723 2693
skein-generic 870 966 996 992 1003 1005 1009
sha256-generic 415 442 453 455 457 457 457
sha512-generic 608 690 711 718 719 720 721
blake3-generic 301 313 311 309 309 310 310
blake3-sse2 343 1865 2124 2188 2180 2181 2186
blake3-sse41 364 2091 2396 2509 2463 2482 2488
blake3-avx2 365 2590 4399 4971 4915 4802 4764
Output on Debian 5.10.0-9-powerpc64le system: (POWER 9)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1213 1703 1889 1918 1957 1902 1907
skein-generic 434 492 520 522 511 525 525
sha256-generic 167 183 187 188 188 187 188
sha512-generic 186 216 222 221 225 224 224
blake3-generic 153 152 154 153 151 153 153
blake3-sse2 391 1170 1366 1406 1428 1426 1414
blake3-sse41 352 1049 1212 1174 1262 1258 1259
Output on Debian 5.10.0-11-arm64 system: (Pi400)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 487 603 629 639 643 641 641
skein-generic 271 299 303 308 309 309 307
sha256-generic 117 127 128 130 130 129 130
sha512-generic 145 165 170 172 173 174 175
blake3-generic 81 29 71 89 89 89 89
blake3-sse2 112 323 368 379 380 371 374
blake3-sse41 101 315 357 368 369 364 360
Structurally, the new code is mainly split into these parts:
- 1x cross platform generic c variant: blake3_generic.c
- 4x assembly for X86-64 (SSE2, SSE4.1, AVX2, AVX512)
- 2x assembly for ARMv8 (NEON converted from SSE2)
- 2x assembly for PPC64-LE (POWER8 converted from SSE2)
- one file for switching between the implementations
Note the PPC64 assembly requires the VSX instruction set and the
kfpu_begin() / kfpu_end() calls on PowerPC were updated accordingly.
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Co-authored-by: Rich Ercolani <rincebrain@gmail.com>
Closes #10058
Closes #12918
2022-06-09 01:55:57 +03:00
|
|
|
while (run_count-- > 0) {
|
|
|
|
|
|
|
|
/* Test current implementation */
|
|
|
|
Blake3_InitKeyed(&ctx, salt_ptr);
|
|
|
|
Blake3_Update(&ctx, buf, size);
|
|
|
|
Blake3_Final(&ctx, res1);
|
|
|
|
zc_res2 = zc_res1;
|
|
|
|
ZIO_CHECKSUM_BSWAP(&zc_res2);
|
|
|
|
|
|
|
|
VERIFY0(memcmp(ref1, res1, 32));
|
|
|
|
VERIFY0(memcmp(ref2, res2, 32));
|
|
|
|
|
|
|
|
/* Test ABD - data */
|
|
|
|
templ = abd_checksum_blake3_tmpl_init(&salt);
|
|
|
|
abd_checksum_blake3_native(abd_data, size,
|
|
|
|
templ, &zc_res1);
|
|
|
|
abd_checksum_blake3_byteswap(abd_data, size,
|
|
|
|
templ, &zc_res2);
|
|
|
|
|
|
|
|
VERIFY0(memcmp(ref1, res1, 32));
|
|
|
|
VERIFY0(memcmp(ref2, res2, 32));
|
|
|
|
|
|
|
|
/* Test ABD - metadata */
|
|
|
|
abd_checksum_blake3_native(abd_meta, size,
|
|
|
|
templ, &zc_res1);
|
|
|
|
abd_checksum_blake3_byteswap(abd_meta, size,
|
|
|
|
templ, &zc_res2);
|
|
|
|
abd_checksum_blake3_tmpl_free(templ);
|
|
|
|
|
|
|
|
VERIFY0(memcmp(ref1, res1, 32));
|
|
|
|
VERIFY0(memcmp(ref2, res2, 32));
|
|
|
|
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
abd_free(abd_data);
|
|
|
|
abd_free(abd_meta);
|
|
|
|
umem_free(buf, size);
|
|
|
|
}
|
|
|
|
|
2015-12-10 02:34:16 +03:00
|
|
|
void
|
|
|
|
ztest_fletcher(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) zd, (void) id;
|
2015-12-10 02:34:16 +03:00
|
|
|
hrtime_t end = gethrtime() + NANOSEC;
|
|
|
|
|
|
|
|
while (gethrtime() <= end) {
|
|
|
|
int run_count = 100;
|
|
|
|
void *buf;
|
2017-02-01 20:34:22 +03:00
|
|
|
struct abd *abd_data, *abd_meta;
|
2015-12-10 02:34:16 +03:00
|
|
|
uint32_t size;
|
|
|
|
int *ptr;
|
|
|
|
int i;
|
|
|
|
zio_cksum_t zc_ref;
|
|
|
|
zio_cksum_t zc_ref_byteswap;
|
|
|
|
|
|
|
|
size = ztest_random_blocksize();
|
2017-02-01 20:34:22 +03:00
|
|
|
|
2015-12-10 02:34:16 +03:00
|
|
|
buf = umem_alloc(size, UMEM_NOFAIL);
|
2017-02-01 20:34:22 +03:00
|
|
|
abd_data = abd_alloc(size, B_FALSE);
|
|
|
|
abd_meta = abd_alloc(size, B_TRUE);
|
2015-12-10 02:34:16 +03:00
|
|
|
|
|
|
|
for (i = 0, ptr = buf; i < size / sizeof (*ptr); i++, ptr++)
|
|
|
|
*ptr = ztest_random(UINT_MAX);
|
|
|
|
|
2017-02-01 20:34:22 +03:00
|
|
|
abd_copy_from_buf_off(abd_data, buf, 0, size);
|
|
|
|
abd_copy_from_buf_off(abd_meta, buf, 0, size);
|
|
|
|
|
2015-12-10 02:34:16 +03:00
|
|
|
VERIFY0(fletcher_4_impl_set("scalar"));
|
2016-06-16 01:47:05 +03:00
|
|
|
fletcher_4_native(buf, size, NULL, &zc_ref);
|
|
|
|
fletcher_4_byteswap(buf, size, NULL, &zc_ref_byteswap);
|
2015-12-10 02:34:16 +03:00
|
|
|
|
|
|
|
VERIFY0(fletcher_4_impl_set("cycle"));
|
|
|
|
while (run_count-- > 0) {
|
|
|
|
zio_cksum_t zc;
|
|
|
|
zio_cksum_t zc_byteswap;
|
|
|
|
|
2016-06-16 01:47:05 +03:00
|
|
|
fletcher_4_byteswap(buf, size, NULL, &zc_byteswap);
|
|
|
|
fletcher_4_native(buf, size, NULL, &zc);
|
2015-12-10 02:34:16 +03:00
|
|
|
|
2022-02-25 16:26:54 +03:00
|
|
|
VERIFY0(memcmp(&zc, &zc_ref, sizeof (zc)));
|
|
|
|
VERIFY0(memcmp(&zc_byteswap, &zc_ref_byteswap,
|
2015-12-10 02:34:16 +03:00
|
|
|
sizeof (zc_byteswap)));
|
2017-02-01 20:34:22 +03:00
|
|
|
|
|
|
|
/* Test ABD - data */
|
|
|
|
abd_fletcher_4_byteswap(abd_data, size, NULL,
|
|
|
|
&zc_byteswap);
|
|
|
|
abd_fletcher_4_native(abd_data, size, NULL, &zc);
|
|
|
|
|
2022-02-25 16:26:54 +03:00
|
|
|
VERIFY0(memcmp(&zc, &zc_ref, sizeof (zc)));
|
|
|
|
VERIFY0(memcmp(&zc_byteswap, &zc_ref_byteswap,
|
2017-02-01 20:34:22 +03:00
|
|
|
sizeof (zc_byteswap)));
|
|
|
|
|
|
|
|
/* Test ABD - metadata */
|
|
|
|
abd_fletcher_4_byteswap(abd_meta, size, NULL,
|
|
|
|
&zc_byteswap);
|
|
|
|
abd_fletcher_4_native(abd_meta, size, NULL, &zc);
|
|
|
|
|
2022-02-25 16:26:54 +03:00
|
|
|
VERIFY0(memcmp(&zc, &zc_ref, sizeof (zc)));
|
|
|
|
VERIFY0(memcmp(&zc_byteswap, &zc_ref_byteswap,
|
2017-02-01 20:34:22 +03:00
|
|
|
sizeof (zc_byteswap)));
|
|
|
|
|
2015-12-10 02:34:16 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
umem_free(buf, size);
|
2017-02-01 20:34:22 +03:00
|
|
|
abd_free(abd_data);
|
|
|
|
abd_free(abd_meta);
|
2015-12-10 02:34:16 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-09-23 04:52:29 +03:00
|
|
|
void
|
|
|
|
ztest_fletcher_incr(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) zd, (void) id;
|
2016-09-23 04:52:29 +03:00
|
|
|
void *buf;
|
|
|
|
size_t size;
|
|
|
|
int *ptr;
|
|
|
|
int i;
|
|
|
|
zio_cksum_t zc_ref;
|
|
|
|
zio_cksum_t zc_ref_bswap;
|
|
|
|
|
|
|
|
hrtime_t end = gethrtime() + NANOSEC;
|
|
|
|
|
|
|
|
while (gethrtime() <= end) {
|
|
|
|
int run_count = 100;
|
|
|
|
|
|
|
|
size = ztest_random_blocksize();
|
|
|
|
buf = umem_alloc(size, UMEM_NOFAIL);
|
|
|
|
|
|
|
|
for (i = 0, ptr = buf; i < size / sizeof (*ptr); i++, ptr++)
|
|
|
|
*ptr = ztest_random(UINT_MAX);
|
|
|
|
|
|
|
|
VERIFY0(fletcher_4_impl_set("scalar"));
|
|
|
|
fletcher_4_native(buf, size, NULL, &zc_ref);
|
|
|
|
fletcher_4_byteswap(buf, size, NULL, &zc_ref_bswap);
|
|
|
|
|
|
|
|
VERIFY0(fletcher_4_impl_set("cycle"));
|
|
|
|
|
|
|
|
while (run_count-- > 0) {
|
|
|
|
zio_cksum_t zc;
|
|
|
|
zio_cksum_t zc_bswap;
|
|
|
|
size_t pos = 0;
|
|
|
|
|
|
|
|
ZIO_SET_CHECKSUM(&zc, 0, 0, 0, 0);
|
|
|
|
ZIO_SET_CHECKSUM(&zc_bswap, 0, 0, 0, 0);
|
|
|
|
|
|
|
|
while (pos < size) {
|
|
|
|
size_t inc = 64 * ztest_random(size / 67);
|
|
|
|
/* sometimes add few bytes to test non-simd */
|
|
|
|
if (ztest_random(100) < 10)
|
2024-05-10 18:47:21 +03:00
|
|
|
inc += P2ALIGN_TYPED(ztest_random(64),
|
|
|
|
sizeof (uint32_t), uint64_t);
|
2016-09-23 04:52:29 +03:00
|
|
|
|
|
|
|
if (inc > (size - pos))
|
|
|
|
inc = size - pos;
|
|
|
|
|
|
|
|
fletcher_4_incremental_native(buf + pos, inc,
|
|
|
|
&zc);
|
|
|
|
fletcher_4_incremental_byteswap(buf + pos, inc,
|
|
|
|
&zc_bswap);
|
|
|
|
|
|
|
|
pos += inc;
|
|
|
|
}
|
|
|
|
|
|
|
|
VERIFY3U(pos, ==, size);
|
|
|
|
|
|
|
|
VERIFY(ZIO_CHECKSUM_EQUAL(zc, zc_ref));
|
|
|
|
VERIFY(ZIO_CHECKSUM_EQUAL(zc_bswap, zc_ref_bswap));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* verify if incremental on the whole buffer is
|
|
|
|
* equivalent to non-incremental version
|
|
|
|
*/
|
|
|
|
ZIO_SET_CHECKSUM(&zc, 0, 0, 0, 0);
|
|
|
|
ZIO_SET_CHECKSUM(&zc_bswap, 0, 0, 0, 0);
|
|
|
|
|
|
|
|
fletcher_4_incremental_native(buf, size, &zc);
|
|
|
|
fletcher_4_incremental_byteswap(buf, size, &zc_bswap);
|
|
|
|
|
|
|
|
VERIFY(ZIO_CHECKSUM_EQUAL(zc, zc_ref));
|
|
|
|
VERIFY(ZIO_CHECKSUM_EQUAL(zc_bswap, zc_ref_bswap));
|
|
|
|
}
|
|
|
|
|
|
|
|
umem_free(buf, size);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2024-07-26 19:16:18 +03:00
|
|
|
void
|
|
|
|
ztest_pool_prefetch_ddt(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
|
|
|
(void) zd, (void) id;
|
|
|
|
spa_t *spa;
|
|
|
|
|
|
|
|
(void) pthread_rwlock_rdlock(&ztest_name_lock);
|
|
|
|
VERIFY0(spa_open(ztest_opts.zo_pool, &spa, FTAG));
|
|
|
|
|
|
|
|
ddt_prefetch_all(spa);
|
|
|
|
|
|
|
|
spa_close(spa, FTAG);
|
|
|
|
(void) pthread_rwlock_unlock(&ztest_name_lock);
|
|
|
|
}
|
|
|
|
|
2021-02-18 14:20:09 +03:00
|
|
|
static int
|
|
|
|
ztest_set_global_vars(void)
|
|
|
|
{
|
|
|
|
for (size_t i = 0; i < ztest_opts.zo_gvars_count; i++) {
|
|
|
|
char *kv = ztest_opts.zo_gvars[i];
|
|
|
|
VERIFY3U(strlen(kv), <=, ZO_GVARS_MAX_ARGLEN);
|
|
|
|
VERIFY3U(strlen(kv), >, 0);
|
|
|
|
int err = set_global_var(kv);
|
|
|
|
if (ztest_opts.zo_verbose > 0) {
|
|
|
|
(void) printf("setting global var %s ... %s\n", kv,
|
|
|
|
err ? "failed" : "ok");
|
|
|
|
}
|
|
|
|
if (err != 0) {
|
|
|
|
(void) fprintf(stderr,
|
|
|
|
"failed to set global var '%s'\n", kv);
|
|
|
|
return (err);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static char **
|
|
|
|
ztest_global_vars_to_zdb_args(void)
|
|
|
|
{
|
|
|
|
char **args = calloc(2*ztest_opts.zo_gvars_count + 1, sizeof (char *));
|
|
|
|
char **cur = args;
|
2022-10-01 03:02:57 +03:00
|
|
|
if (args == NULL)
|
|
|
|
return (NULL);
|
2021-02-18 14:20:09 +03:00
|
|
|
for (size_t i = 0; i < ztest_opts.zo_gvars_count; i++) {
|
2022-04-19 21:38:30 +03:00
|
|
|
*cur++ = (char *)"-o";
|
|
|
|
*cur++ = ztest_opts.zo_gvars[i];
|
2021-02-18 14:20:09 +03:00
|
|
|
}
|
|
|
|
ASSERT3P(cur, ==, &args[2*ztest_opts.zo_gvars_count]);
|
|
|
|
*cur = NULL;
|
|
|
|
return (args);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* The end of strings is indicated by a NULL element */
|
|
|
|
static char *
|
|
|
|
join_strings(char **strings, const char *sep)
|
|
|
|
{
|
|
|
|
size_t totallen = 0;
|
|
|
|
for (char **sp = strings; *sp != NULL; sp++) {
|
|
|
|
totallen += strlen(*sp);
|
|
|
|
totallen += strlen(sep);
|
|
|
|
}
|
|
|
|
if (totallen > 0) {
|
|
|
|
ASSERT(totallen >= strlen(sep));
|
|
|
|
totallen -= strlen(sep);
|
|
|
|
}
|
|
|
|
|
|
|
|
size_t buflen = totallen + 1;
|
Handle possible null pointers from malloc/strdup/strndup()
GCC 12.1.1_p20220625's static analyzer caught these.
Of the two in the btree test, one had previously been caught by Coverity
and Smatch, but GCC flagged it as a false positive. Upon examining how
other test cases handle this, the solution was changed from
`ASSERT3P(node, !=, NULL);` to using `perror()` to be consistent with
the fixes to the other fixes done to the ZTS code.
That approach was also used in ZED since I did not see a better way of
handling this there. Also, upon inspection, additional unchecked
pointers from malloc()/calloc()/strdup() were found in ZED, so those
were handled too.
In other parts of the code, the existing methods to avoid issues from
memory allocators returning NULL were used, such as using
`umem_alloc(size, UMEM_NOFAIL)` or returning `ENOMEM`.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13979
2022-10-07 03:18:40 +03:00
|
|
|
char *o = umem_alloc(buflen, UMEM_NOFAIL); /* trailing 0 byte */
|
2021-02-18 14:20:09 +03:00
|
|
|
o[0] = '\0';
|
|
|
|
for (char **sp = strings; *sp != NULL; sp++) {
|
|
|
|
size_t would;
|
|
|
|
would = strlcat(o, *sp, buflen);
|
|
|
|
VERIFY3U(would, <, buflen);
|
|
|
|
if (*(sp+1) == NULL) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
would = strlcat(o, sep, buflen);
|
|
|
|
VERIFY3U(would, <, buflen);
|
|
|
|
}
|
|
|
|
ASSERT3S(strlen(o), ==, totallen);
|
|
|
|
return (o);
|
|
|
|
}
|
|
|
|
|
2015-11-21 02:50:06 +03:00
|
|
|
static int
|
|
|
|
ztest_check_path(char *path)
|
|
|
|
{
|
|
|
|
struct stat s;
|
|
|
|
/* return true on success */
|
|
|
|
return (!stat(path, &s));
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
ztest_get_zdb_bin(char *bin, int len)
|
|
|
|
{
|
|
|
|
char *zdb_path;
|
|
|
|
/*
|
2022-05-03 13:15:46 +03:00
|
|
|
* Try to use $ZDB and in-tree zdb path. If not successful, just
|
2015-11-21 02:50:06 +03:00
|
|
|
* let popen to search through PATH.
|
|
|
|
*/
|
2022-05-03 13:15:46 +03:00
|
|
|
if ((zdb_path = getenv("ZDB"))) {
|
2015-11-21 02:50:06 +03:00
|
|
|
strlcpy(bin, zdb_path, len); /* In env */
|
|
|
|
if (!ztest_check_path(bin)) {
|
|
|
|
ztest_dump_core = 0;
|
2022-05-03 13:15:46 +03:00
|
|
|
fatal(B_TRUE, "invalid ZDB '%s'", bin);
|
2015-11-21 02:50:06 +03:00
|
|
|
}
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY3P(realpath(getexecname(), bin), !=, NULL);
|
2022-05-03 13:22:30 +03:00
|
|
|
if (strstr(bin, ".libs/ztest")) {
|
|
|
|
strstr(bin, ".libs/ztest")[0] = '\0'; /* In-tree */
|
|
|
|
strcat(bin, "zdb");
|
2015-11-21 02:50:06 +03:00
|
|
|
if (ztest_check_path(bin))
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
strcpy(bin, "zdb");
|
|
|
|
}
|
|
|
|
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
static vdev_t *
|
|
|
|
ztest_random_concrete_vdev_leaf(vdev_t *vd)
|
|
|
|
{
|
|
|
|
if (vd == NULL)
|
|
|
|
return (NULL);
|
|
|
|
|
|
|
|
if (vd->vdev_children == 0)
|
|
|
|
return (vd);
|
|
|
|
|
|
|
|
vdev_t *eligible[vd->vdev_children];
|
|
|
|
int eligible_idx = 0, i;
|
|
|
|
for (i = 0; i < vd->vdev_children; i++) {
|
|
|
|
vdev_t *cvd = vd->vdev_child[i];
|
|
|
|
if (cvd->vdev_top->vdev_removing)
|
|
|
|
continue;
|
|
|
|
if (cvd->vdev_children > 0 ||
|
|
|
|
(vdev_is_concrete(cvd) && !cvd->vdev_detached)) {
|
|
|
|
eligible[eligible_idx++] = cvd;
|
|
|
|
}
|
|
|
|
}
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY3S(eligible_idx, >, 0);
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
|
|
|
|
uint64_t child_no = ztest_random(eligible_idx);
|
|
|
|
return (ztest_random_concrete_vdev_leaf(eligible[child_no]));
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
ztest_initialize(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) zd, (void) id;
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
spa_t *spa = ztest_spa;
|
|
|
|
int error = 0;
|
|
|
|
|
|
|
|
mutex_enter(&ztest_vdev_lock);
|
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
|
|
|
|
|
|
|
|
/* Random leaf vdev */
|
|
|
|
vdev_t *rand_vd = ztest_random_concrete_vdev_leaf(spa->spa_root_vdev);
|
|
|
|
if (rand_vd == NULL) {
|
|
|
|
spa_config_exit(spa, SCL_VDEV, FTAG);
|
|
|
|
mutex_exit(&ztest_vdev_lock);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The random vdev we've selected may change as soon as we
|
|
|
|
* drop the spa_config_lock. We create local copies of things
|
|
|
|
* we're interested in.
|
|
|
|
*/
|
|
|
|
uint64_t guid = rand_vd->vdev_guid;
|
|
|
|
char *path = strdup(rand_vd->vdev_path);
|
|
|
|
boolean_t active = rand_vd->vdev_initialize_thread != NULL;
|
|
|
|
|
2021-06-23 07:53:45 +03:00
|
|
|
zfs_dbgmsg("vd %px, guid %llu", rand_vd, (u_longlong_t)guid);
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
spa_config_exit(spa, SCL_VDEV, FTAG);
|
|
|
|
|
|
|
|
uint64_t cmd = ztest_random(POOL_INITIALIZE_FUNCS);
|
2018-12-19 19:20:39 +03:00
|
|
|
|
|
|
|
nvlist_t *vdev_guids = fnvlist_alloc();
|
|
|
|
nvlist_t *vdev_errlist = fnvlist_alloc();
|
|
|
|
fnvlist_add_uint64(vdev_guids, path, guid);
|
|
|
|
error = spa_vdev_initialize(spa, vdev_guids, cmd, vdev_errlist);
|
|
|
|
fnvlist_free(vdev_guids);
|
|
|
|
fnvlist_free(vdev_errlist);
|
|
|
|
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
switch (cmd) {
|
|
|
|
case POOL_INITIALIZE_CANCEL:
|
|
|
|
if (ztest_opts.zo_verbose >= 4) {
|
|
|
|
(void) printf("Cancel initialize %s", path);
|
|
|
|
if (!active)
|
|
|
|
(void) printf(" failed (no initialize active)");
|
|
|
|
(void) printf("\n");
|
|
|
|
}
|
|
|
|
break;
|
2019-03-29 19:13:20 +03:00
|
|
|
case POOL_INITIALIZE_START:
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
if (ztest_opts.zo_verbose >= 4) {
|
|
|
|
(void) printf("Start initialize %s", path);
|
|
|
|
if (active && error == 0)
|
|
|
|
(void) printf(" failed (already active)");
|
|
|
|
else if (error != 0)
|
|
|
|
(void) printf(" failed (error %d)", error);
|
|
|
|
(void) printf("\n");
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case POOL_INITIALIZE_SUSPEND:
|
|
|
|
if (ztest_opts.zo_verbose >= 4) {
|
|
|
|
(void) printf("Suspend initialize %s", path);
|
|
|
|
if (!active)
|
|
|
|
(void) printf(" failed (no initialize active)");
|
|
|
|
(void) printf("\n");
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
free(path);
|
|
|
|
mutex_exit(&ztest_vdev_lock);
|
|
|
|
}
|
|
|
|
|
2019-03-29 19:13:20 +03:00
|
|
|
void
|
|
|
|
ztest_trim(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) zd, (void) id;
|
2019-03-29 19:13:20 +03:00
|
|
|
spa_t *spa = ztest_spa;
|
|
|
|
int error = 0;
|
|
|
|
|
|
|
|
mutex_enter(&ztest_vdev_lock);
|
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
|
|
|
|
|
|
|
|
/* Random leaf vdev */
|
|
|
|
vdev_t *rand_vd = ztest_random_concrete_vdev_leaf(spa->spa_root_vdev);
|
|
|
|
if (rand_vd == NULL) {
|
|
|
|
spa_config_exit(spa, SCL_VDEV, FTAG);
|
|
|
|
mutex_exit(&ztest_vdev_lock);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The random vdev we've selected may change as soon as we
|
|
|
|
* drop the spa_config_lock. We create local copies of things
|
|
|
|
* we're interested in.
|
|
|
|
*/
|
|
|
|
uint64_t guid = rand_vd->vdev_guid;
|
|
|
|
char *path = strdup(rand_vd->vdev_path);
|
|
|
|
boolean_t active = rand_vd->vdev_trim_thread != NULL;
|
|
|
|
|
2021-06-23 07:53:45 +03:00
|
|
|
zfs_dbgmsg("vd %p, guid %llu", rand_vd, (u_longlong_t)guid);
|
2019-03-29 19:13:20 +03:00
|
|
|
spa_config_exit(spa, SCL_VDEV, FTAG);
|
|
|
|
|
|
|
|
uint64_t cmd = ztest_random(POOL_TRIM_FUNCS);
|
|
|
|
uint64_t rate = 1 << ztest_random(30);
|
|
|
|
boolean_t partial = (ztest_random(5) > 0);
|
|
|
|
boolean_t secure = (ztest_random(5) > 0);
|
|
|
|
|
|
|
|
nvlist_t *vdev_guids = fnvlist_alloc();
|
|
|
|
nvlist_t *vdev_errlist = fnvlist_alloc();
|
|
|
|
fnvlist_add_uint64(vdev_guids, path, guid);
|
|
|
|
error = spa_vdev_trim(spa, vdev_guids, cmd, rate, partial,
|
|
|
|
secure, vdev_errlist);
|
|
|
|
fnvlist_free(vdev_guids);
|
|
|
|
fnvlist_free(vdev_errlist);
|
|
|
|
|
|
|
|
switch (cmd) {
|
|
|
|
case POOL_TRIM_CANCEL:
|
|
|
|
if (ztest_opts.zo_verbose >= 4) {
|
|
|
|
(void) printf("Cancel TRIM %s", path);
|
|
|
|
if (!active)
|
|
|
|
(void) printf(" failed (no TRIM active)");
|
|
|
|
(void) printf("\n");
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case POOL_TRIM_START:
|
|
|
|
if (ztest_opts.zo_verbose >= 4) {
|
|
|
|
(void) printf("Start TRIM %s", path);
|
|
|
|
if (active && error == 0)
|
|
|
|
(void) printf(" failed (already active)");
|
|
|
|
else if (error != 0)
|
|
|
|
(void) printf(" failed (error %d)", error);
|
|
|
|
(void) printf("\n");
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case POOL_TRIM_SUSPEND:
|
|
|
|
if (ztest_opts.zo_verbose >= 4) {
|
|
|
|
(void) printf("Suspend TRIM %s", path);
|
|
|
|
if (!active)
|
|
|
|
(void) printf(" failed (no TRIM active)");
|
|
|
|
(void) printf("\n");
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
free(path);
|
|
|
|
mutex_exit(&ztest_vdev_lock);
|
|
|
|
}
|
|
|
|
|
2024-06-18 01:35:18 +03:00
|
|
|
void
|
|
|
|
ztest_ddt_prune(ztest_ds_t *zd, uint64_t id)
|
|
|
|
{
|
|
|
|
(void) zd, (void) id;
|
|
|
|
|
|
|
|
spa_t *spa = ztest_spa;
|
|
|
|
uint64_t pct = ztest_random(15) + 1;
|
|
|
|
|
|
|
|
(void) ddt_prune_unique_entries(spa, ZPOOL_DDT_PRUNE_PERCENTAGE, pct);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Verify pool integrity by running zdb.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
2023-09-23 02:08:51 +03:00
|
|
|
ztest_run_zdb(uint64_t guid)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
int status;
|
|
|
|
char *bin;
|
2010-08-26 22:59:11 +04:00
|
|
|
char *zdb;
|
|
|
|
char *zbuf;
|
2015-11-21 02:50:06 +03:00
|
|
|
const int len = MAXPATHLEN + MAXNAMELEN + 20;
|
2008-11-20 23:01:55 +03:00
|
|
|
FILE *fp;
|
|
|
|
|
2015-11-21 02:50:06 +03:00
|
|
|
bin = umem_alloc(len, UMEM_NOFAIL);
|
|
|
|
zdb = umem_alloc(len, UMEM_NOFAIL);
|
2010-08-26 22:59:11 +04:00
|
|
|
zbuf = umem_alloc(1024, UMEM_NOFAIL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-11-21 02:50:06 +03:00
|
|
|
ztest_get_zdb_bin(bin, len);
|
2010-08-26 22:59:11 +04:00
|
|
|
|
2021-02-18 14:20:09 +03:00
|
|
|
char **set_gvars_args = ztest_global_vars_to_zdb_args();
|
2022-10-01 03:02:57 +03:00
|
|
|
if (set_gvars_args == NULL) {
|
|
|
|
fatal(B_FALSE, "Failed to allocate memory in "
|
|
|
|
"ztest_global_vars_to_zdb_args(). Cannot run zdb.\n");
|
|
|
|
}
|
2021-02-18 14:20:09 +03:00
|
|
|
char *set_gvars_args_joined = join_strings(set_gvars_args, " ");
|
|
|
|
free(set_gvars_args);
|
|
|
|
|
|
|
|
size_t would = snprintf(zdb, len,
|
2023-09-23 02:08:51 +03:00
|
|
|
"%s -bcc%s%s -G -d -Y -e -y %s -p %s %"PRIu64,
|
2010-08-26 22:59:11 +04:00
|
|
|
bin,
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_opts.zo_verbose >= 3 ? "s" : "",
|
|
|
|
ztest_opts.zo_verbose >= 4 ? "v" : "",
|
2021-02-18 14:20:09 +03:00
|
|
|
set_gvars_args_joined,
|
2020-05-30 07:14:10 +03:00
|
|
|
ztest_opts.zo_dir,
|
2023-09-23 02:08:51 +03:00
|
|
|
guid);
|
2021-02-18 14:20:09 +03:00
|
|
|
ASSERT3U(would, <, len);
|
|
|
|
|
Handle possible null pointers from malloc/strdup/strndup()
GCC 12.1.1_p20220625's static analyzer caught these.
Of the two in the btree test, one had previously been caught by Coverity
and Smatch, but GCC flagged it as a false positive. Upon examining how
other test cases handle this, the solution was changed from
`ASSERT3P(node, !=, NULL);` to using `perror()` to be consistent with
the fixes to the other fixes done to the ZTS code.
That approach was also used in ZED since I did not see a better way of
handling this there. Also, upon inspection, additional unchecked
pointers from malloc()/calloc()/strdup() were found in ZED, so those
were handled too.
In other parts of the code, the existing methods to avoid issues from
memory allocators returning NULL were used, such as using
`umem_alloc(size, UMEM_NOFAIL)` or returning `ENOMEM`.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13979
2022-10-07 03:18:40 +03:00
|
|
|
umem_free(set_gvars_args_joined, strlen(set_gvars_args_joined) + 1);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 5)
|
2022-05-03 13:15:46 +03:00
|
|
|
(void) printf("Executing %s\n", zdb);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
fp = popen(zdb, "r");
|
|
|
|
|
Fix ztest vdev file paths.
Currently, in several instances (but not all), ztest generates vdev
file paths using a statement similar to this:
snprintf(path, sizeof (path), ztest_dev_template, ...);
This worked fine until 40b84e7aec6392187722e61e5a4a853b530bf60f, which
changed path to be a pointer to the heap instead of an array allocated
on the stack. Before this change, sizeof(path) would return the size of
the array; now, it returns the size of the pointer instead.
As a result, the aforementioned sprintf statement uses the wrong size
and truncates the vdev file path to the first 4 or 8 bytes (depending
on the architecture). Typically, with default settings, the file path
will become "/tmp/zt" instead of "/test/ztest.XXX".
This issue only exists in ztest_vdev_attach_detach() and
ztest_fault_inject(), which explains why ztest doesn't fail right away.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #989
2012-10-01 18:34:52 +04:00
|
|
|
while (fgets(zbuf, 1024, fp) != NULL)
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 3)
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("%s", zbuf);
|
|
|
|
|
|
|
|
status = pclose(fp);
|
|
|
|
|
|
|
|
if (status == 0)
|
2010-08-26 22:59:11 +04:00
|
|
|
goto out;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ztest_dump_core = 0;
|
|
|
|
if (WIFEXITED(status))
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "'%s' exit code %d", zdb, WEXITSTATUS(status));
|
2008-11-20 23:01:55 +03:00
|
|
|
else
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "'%s' died with signal %d",
|
|
|
|
zdb, WTERMSIG(status));
|
2010-08-26 22:59:11 +04:00
|
|
|
out:
|
2015-11-21 02:50:06 +03:00
|
|
|
umem_free(bin, len);
|
|
|
|
umem_free(zdb, len);
|
2010-08-26 22:59:11 +04:00
|
|
|
umem_free(zbuf, 1024);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2022-04-19 21:38:30 +03:00
|
|
|
ztest_walk_pool_directory(const char *header)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
spa_t *spa = NULL;
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 6)
|
2022-04-19 21:38:30 +03:00
|
|
|
(void) puts(header);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
while ((spa = spa_next(spa)) != NULL)
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 6)
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("\t%s\n", spa_name(spa));
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
ztest_spa_import_export(char *oldname, char *newname)
|
|
|
|
{
|
2009-01-16 00:59:39 +03:00
|
|
|
nvlist_t *config, *newconfig;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t pool_guid;
|
|
|
|
spa_t *spa;
|
2013-09-04 16:00:57 +04:00
|
|
|
int error;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 4) {
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("import/export: old = %s, new = %s\n",
|
|
|
|
oldname, newname);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Clean up from previous runs.
|
|
|
|
*/
|
|
|
|
(void) spa_destroy(newname);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Get the pool's configuration and guid.
|
|
|
|
*/
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(spa_open(oldname, &spa, FTAG));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2009-01-16 00:59:39 +03:00
|
|
|
/*
|
|
|
|
* Kick off a scrub to tickle scrub/export races.
|
|
|
|
*/
|
|
|
|
if (ztest_random(2) == 0)
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) spa_scan(spa, POOL_SCAN_SCRUB);
|
2009-01-16 00:59:39 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
pool_guid = spa_guid(spa);
|
|
|
|
spa_close(spa, FTAG);
|
|
|
|
|
|
|
|
ztest_walk_pool_directory("pools before export");
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Export it.
|
|
|
|
*/
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(spa_export(oldname, &config, B_FALSE, B_FALSE));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ztest_walk_pool_directory("pools after export");
|
|
|
|
|
2009-01-16 00:59:39 +03:00
|
|
|
/*
|
|
|
|
* Try to import it.
|
|
|
|
*/
|
|
|
|
newconfig = spa_tryimport(config);
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3P(newconfig, !=, NULL);
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(newconfig);
|
2009-01-16 00:59:39 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Import it under the new name.
|
|
|
|
*/
|
2013-09-04 16:00:57 +04:00
|
|
|
error = spa_import(newname, config, NULL, 0);
|
|
|
|
if (error != 0) {
|
|
|
|
dump_nvlist(config, 0);
|
|
|
|
fatal(B_FALSE, "couldn't import pool %s as %s: error %u",
|
|
|
|
oldname, newname, error);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ztest_walk_pool_directory("pools after import");
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Try to import it again -- should fail with EEXIST.
|
|
|
|
*/
|
2010-08-27 01:24:34 +04:00
|
|
|
VERIFY3U(EEXIST, ==, spa_import(newname, config, NULL, 0));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Try to import it under a different name -- should fail with EEXIST.
|
|
|
|
*/
|
2010-08-27 01:24:34 +04:00
|
|
|
VERIFY3U(EEXIST, ==, spa_import(oldname, config, NULL, 0));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that the pool is no longer visible under the old name.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
VERIFY3U(ENOENT, ==, spa_open(oldname, &spa, FTAG));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that we can open and close the pool using the new name.
|
|
|
|
*/
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(spa_open(newname, &spa, FTAG));
|
|
|
|
ASSERT3U(pool_guid, ==, spa_guid(spa));
|
2008-11-20 23:01:55 +03:00
|
|
|
spa_close(spa, FTAG);
|
|
|
|
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(config);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2009-01-16 00:59:39 +03:00
|
|
|
static void
|
|
|
|
ztest_resume(spa_t *spa)
|
|
|
|
{
|
2012-01-24 06:43:32 +04:00
|
|
|
if (spa_suspended(spa) && ztest_opts.zo_verbose >= 6)
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("resuming from suspended state\n");
|
|
|
|
spa_vdev_state_enter(spa, SCL_NONE);
|
|
|
|
vdev_clear(spa, NULL);
|
|
|
|
(void) spa_vdev_state_exit(spa, NULL, 0);
|
|
|
|
(void) zio_resume(spa);
|
2009-01-16 00:59:39 +03:00
|
|
|
}
|
|
|
|
|
2022-03-23 18:51:00 +03:00
|
|
|
static __attribute__((noreturn)) void
|
2009-01-16 00:59:39 +03:00
|
|
|
ztest_resume_thread(void *arg)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_t *spa = arg;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2024-06-18 01:35:18 +03:00
|
|
|
/*
|
|
|
|
* Synthesize aged DDT entries for ddt prune testing
|
|
|
|
*/
|
|
|
|
ddt_prune_artificial_age = B_TRUE;
|
|
|
|
if (ztest_opts.zo_verbose >= 3)
|
|
|
|
ddt_dump_prune_histogram = B_TRUE;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
while (!ztest_exiting) {
|
2010-05-29 00:45:14 +04:00
|
|
|
if (spa_suspended(spa))
|
|
|
|
ztest_resume(spa);
|
|
|
|
(void) poll(NULL, 0, 100);
|
2016-06-02 07:04:53 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Periodically change the zfs_compressed_arc_enabled setting.
|
|
|
|
*/
|
|
|
|
if (ztest_random(10) == 0)
|
|
|
|
zfs_compressed_arc_enabled = ztest_random(2);
|
2016-07-22 18:52:49 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Periodically change the zfs_abd_scatter_enabled setting.
|
|
|
|
*/
|
|
|
|
if (ztest_random(10) == 0)
|
|
|
|
zfs_abd_scatter_enabled = ztest_random(2);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-08-26 21:43:27 +04:00
|
|
|
thread_exit();
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2022-03-23 18:51:00 +03:00
|
|
|
static __attribute__((noreturn)) void
|
2017-12-19 01:06:07 +03:00
|
|
|
ztest_deadman_thread(void *arg)
|
2010-08-26 21:43:27 +04:00
|
|
|
{
|
2017-12-19 01:06:07 +03:00
|
|
|
ztest_shared_t *zs = arg;
|
|
|
|
spa_t *spa = ztest_spa;
|
2018-10-10 23:48:33 +03:00
|
|
|
hrtime_t delay, overdue, last_run = gethrtime();
|
2017-12-19 01:06:07 +03:00
|
|
|
|
2018-10-10 23:48:33 +03:00
|
|
|
delay = (zs->zs_thread_stop - zs->zs_thread_start) +
|
|
|
|
MSEC2NSEC(zfs_deadman_synctime_ms);
|
2017-12-19 01:06:07 +03:00
|
|
|
|
2018-10-10 23:48:33 +03:00
|
|
|
while (!ztest_exiting) {
|
|
|
|
/*
|
|
|
|
* Wait for the delay timer while checking occasionally
|
|
|
|
* if we should stop.
|
|
|
|
*/
|
|
|
|
if (gethrtime() < last_run + delay) {
|
|
|
|
(void) poll(NULL, 0, 1000);
|
|
|
|
continue;
|
|
|
|
}
|
2017-12-19 01:06:07 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the pool is suspended then fail immediately. Otherwise,
|
|
|
|
* check to see if the pool is making any progress. If
|
|
|
|
* vdev_deadman() discovers that there hasn't been any recent
|
|
|
|
* I/Os then it will end up aborting the tests.
|
|
|
|
*/
|
|
|
|
if (spa_suspended(spa) || spa->spa_root_vdev == NULL) {
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE,
|
Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
* dbuf_cache_max_bytes
* dbuf_metadata_cache_max_bytes
* l2arc_feed_min_ms
* l2arc_feed_secs
* l2arc_headroom
* l2arc_headroom_boost
* l2arc_write_boost
* l2arc_write_max
* metaslab_aliquot
* metaslab_force_ganging
* zfetch_array_rd_sz
* zfs_arc_max
* zfs_arc_meta_limit
* zfs_arc_meta_min
* zfs_arc_min
* zfs_async_block_max_blocks
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
* zfs_deadman_checktime_ms
* zfs_deadman_synctime_ms
* zfs_initialize_chunk_size
* zfs_initialize_value
* zfs_lua_max_instrlimit
* zfs_lua_max_memlimit
* zil_slog_bulk
Parameters that were originally 32-bit on Illumos:
* zfs_per_txg_dirty_frees_percent
Parameters that were originally `ssize_t` on Illumos:
* zfs_immediate_write_sz
Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.
Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:
* l2arc_rebuild_blocks_min_l2size
* zfs_key_max_salt_uses
* zfs_max_log_walking
* zfs_max_logsm_summary_length
* zfs_metaslab_max_size_cache_sec
* zfs_min_metaslabs_to_flush
* zfs_multihost_interval
* zfs_unflushed_log_block_max
* zfs_unflushed_log_block_min
* zfs_unflushed_log_block_pct
* zfs_unflushed_max_mem_amt
* zfs_unflushed_max_mem_ppm
New parameters that do not exist in Illumos:
* l2arc_trim_ahead
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_arc_sys_free
* zfs_deadman_ziotime_ms
* zfs_delete_blocks
* zfs_history_output_max
* zfs_livelist_max_entries
* zfs_max_async_dedup_frees
* zfs_max_nvlist_src_size
* zfs_rebuild_max_segment
* zfs_rebuild_vdev_limit
* zfs_unflushed_log_txg_max
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
* zfs_vnops_read_chunk_size
* zvol_max_discard_blocks
Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.
A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:
* zfs_delete_blocks
* zfs_key_max_salt_uses
* zvol_max_discard_blocks
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_per_txg_dirty_frees_percent
* zfs_unflushed_log_block_pct
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
2022-10-03 22:06:54 +03:00
|
|
|
"aborting test after %llu seconds because "
|
2017-12-19 01:06:07 +03:00
|
|
|
"pool has transitioned to a suspended state.",
|
Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
* dbuf_cache_max_bytes
* dbuf_metadata_cache_max_bytes
* l2arc_feed_min_ms
* l2arc_feed_secs
* l2arc_headroom
* l2arc_headroom_boost
* l2arc_write_boost
* l2arc_write_max
* metaslab_aliquot
* metaslab_force_ganging
* zfetch_array_rd_sz
* zfs_arc_max
* zfs_arc_meta_limit
* zfs_arc_meta_min
* zfs_arc_min
* zfs_async_block_max_blocks
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
* zfs_deadman_checktime_ms
* zfs_deadman_synctime_ms
* zfs_initialize_chunk_size
* zfs_initialize_value
* zfs_lua_max_instrlimit
* zfs_lua_max_memlimit
* zil_slog_bulk
Parameters that were originally 32-bit on Illumos:
* zfs_per_txg_dirty_frees_percent
Parameters that were originally `ssize_t` on Illumos:
* zfs_immediate_write_sz
Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.
Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:
* l2arc_rebuild_blocks_min_l2size
* zfs_key_max_salt_uses
* zfs_max_log_walking
* zfs_max_logsm_summary_length
* zfs_metaslab_max_size_cache_sec
* zfs_min_metaslabs_to_flush
* zfs_multihost_interval
* zfs_unflushed_log_block_max
* zfs_unflushed_log_block_min
* zfs_unflushed_log_block_pct
* zfs_unflushed_max_mem_amt
* zfs_unflushed_max_mem_ppm
New parameters that do not exist in Illumos:
* l2arc_trim_ahead
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_arc_sys_free
* zfs_deadman_ziotime_ms
* zfs_delete_blocks
* zfs_history_output_max
* zfs_livelist_max_entries
* zfs_max_async_dedup_frees
* zfs_max_nvlist_src_size
* zfs_rebuild_max_segment
* zfs_rebuild_vdev_limit
* zfs_unflushed_log_txg_max
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
* zfs_vnops_read_chunk_size
* zvol_max_discard_blocks
Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.
A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:
* zfs_delete_blocks
* zfs_key_max_salt_uses
* zvol_max_discard_blocks
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_per_txg_dirty_frees_percent
* zfs_unflushed_log_block_pct
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
2022-10-03 22:06:54 +03:00
|
|
|
(u_longlong_t)zfs_deadman_synctime_ms / 1000);
|
2017-12-19 01:06:07 +03:00
|
|
|
}
|
|
|
|
vdev_deadman(spa->spa_root_vdev, FTAG);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the process doesn't complete within a grace period of
|
|
|
|
* zfs_deadman_synctime_ms over the expected finish time,
|
|
|
|
* then it may be hung and is terminated.
|
|
|
|
*/
|
|
|
|
overdue = zs->zs_proc_stop + MSEC2NSEC(zfs_deadman_synctime_ms);
|
|
|
|
if (gethrtime() > overdue) {
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE,
|
|
|
|
"aborting test after %llu seconds because "
|
2018-10-10 23:48:33 +03:00
|
|
|
"the process is overdue for termination.",
|
|
|
|
(gethrtime() - zs->zs_proc_start) / NANOSEC);
|
2017-12-19 01:06:07 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
(void) printf("ztest has been running for %lld seconds\n",
|
2018-10-10 23:48:33 +03:00
|
|
|
(gethrtime() - zs->zs_proc_start) / NANOSEC);
|
|
|
|
|
|
|
|
last_run = gethrtime();
|
|
|
|
delay = MSEC2NSEC(zfs_deadman_checktime_ms);
|
2017-12-19 01:06:07 +03:00
|
|
|
}
|
2018-10-10 23:48:33 +03:00
|
|
|
|
|
|
|
thread_exit();
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_execute(int test, ztest_info_t *zi, uint64_t id)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_ds_t *zd = &ztest_ds[id % ztest_opts.zo_datasets];
|
|
|
|
ztest_shared_callstate_t *zc = ZTEST_GET_SHARED_CALLSTATE(test);
|
2010-05-29 00:45:14 +04:00
|
|
|
hrtime_t functime = gethrtime();
|
2010-08-26 20:52:39 +04:00
|
|
|
int i;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (i = 0; i < zi->zi_iters; i++)
|
2010-05-29 00:45:14 +04:00
|
|
|
zi->zi_func(zd, id);
|
|
|
|
|
|
|
|
functime = gethrtime() - functime;
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
atomic_add_64(&zc->zc_count, 1);
|
|
|
|
atomic_add_64(&zc->zc_time, functime);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2015-02-24 21:53:31 +03:00
|
|
|
if (ztest_opts.zo_verbose >= 4)
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("%6.2f sec in %s\n",
|
2015-02-24 21:53:31 +03:00
|
|
|
(double)functime / NANOSEC, zi->zi_funcname);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
typedef struct ztest_raidz_expand_io {
|
|
|
|
uint64_t rzx_id;
|
|
|
|
uint64_t rzx_amount;
|
|
|
|
uint64_t rzx_bufsize;
|
|
|
|
const void *rzx_buffer;
|
|
|
|
uint64_t rzx_alloc_max;
|
|
|
|
spa_t *rzx_spa;
|
|
|
|
} ztest_expand_io_t;
|
|
|
|
|
|
|
|
#undef OD_ARRAY_SIZE
|
|
|
|
#define OD_ARRAY_SIZE 10
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Write a request amount of data to some dataset objects.
|
|
|
|
* There will be ztest_opts.zo_threads count of these running in parallel.
|
|
|
|
*/
|
|
|
|
static __attribute__((noreturn)) void
|
|
|
|
ztest_rzx_thread(void *arg)
|
|
|
|
{
|
|
|
|
ztest_expand_io_t *info = (ztest_expand_io_t *)arg;
|
|
|
|
ztest_od_t *od;
|
|
|
|
int batchsize;
|
|
|
|
int od_size;
|
|
|
|
ztest_ds_t *zd = &ztest_ds[info->rzx_id % ztest_opts.zo_datasets];
|
|
|
|
spa_t *spa = info->rzx_spa;
|
|
|
|
|
|
|
|
od_size = sizeof (ztest_od_t) * OD_ARRAY_SIZE;
|
|
|
|
od = umem_alloc(od_size, UMEM_NOFAIL);
|
|
|
|
batchsize = OD_ARRAY_SIZE;
|
|
|
|
|
|
|
|
/* Create objects to write to */
|
|
|
|
for (int b = 0; b < batchsize; b++) {
|
|
|
|
ztest_od_init(od + b, info->rzx_id, FTAG, b,
|
|
|
|
DMU_OT_UINT64_OTHER, 0, 0, 0);
|
|
|
|
}
|
|
|
|
if (ztest_object_init(zd, od, od_size, B_FALSE) != 0) {
|
|
|
|
umem_free(od, od_size);
|
|
|
|
thread_exit();
|
|
|
|
}
|
|
|
|
|
|
|
|
for (uint64_t offset = 0, written = 0; written < info->rzx_amount;
|
|
|
|
offset += info->rzx_bufsize) {
|
|
|
|
/* write to 10 objects */
|
|
|
|
for (int i = 0; i < batchsize && written < info->rzx_amount;
|
|
|
|
i++) {
|
|
|
|
(void) pthread_rwlock_rdlock(&zd->zd_zilog_lock);
|
|
|
|
ztest_write(zd, od[i].od_object, offset,
|
|
|
|
info->rzx_bufsize, info->rzx_buffer);
|
|
|
|
(void) pthread_rwlock_unlock(&zd->zd_zilog_lock);
|
|
|
|
written += info->rzx_bufsize;
|
|
|
|
}
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
/* due to inflation, we'll typically bail here */
|
|
|
|
if (metaslab_class_get_alloc(spa_normal_class(spa)) >
|
|
|
|
info->rzx_alloc_max) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Remove a few objects to leave some holes in allocation space */
|
|
|
|
mutex_enter(&zd->zd_dirobj_lock);
|
|
|
|
(void) ztest_remove(zd, od, 2);
|
|
|
|
mutex_exit(&zd->zd_dirobj_lock);
|
|
|
|
|
|
|
|
umem_free(od, od_size);
|
|
|
|
|
|
|
|
thread_exit();
|
|
|
|
}
|
|
|
|
|
2022-03-23 18:51:00 +03:00
|
|
|
static __attribute__((noreturn)) void
|
2008-11-20 23:01:55 +03:00
|
|
|
ztest_thread(void *arg)
|
|
|
|
{
|
2012-01-24 06:43:32 +04:00
|
|
|
int rand;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t id = (uintptr_t)arg;
|
2008-11-20 23:01:55 +03:00
|
|
|
ztest_shared_t *zs = ztest_shared;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t call_next;
|
|
|
|
hrtime_t now;
|
2008-11-20 23:01:55 +03:00
|
|
|
ztest_info_t *zi;
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_shared_callstate_t *zc;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
while ((now = gethrtime()) < zs->zs_thread_stop) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* See if it's time to force a crash.
|
|
|
|
*/
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
if (now > zs->zs_thread_kill &&
|
|
|
|
raidz_expand_pause_point == RAIDZ_EXPAND_PAUSE_NONE) {
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_kill(zs);
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* If we're getting ENOSPC with some regularity, stop.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
if (zs->zs_enospc_count > 10)
|
|
|
|
break;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Pick a random function to execute.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2012-01-24 06:43:32 +04:00
|
|
|
rand = ztest_random(ZTEST_FUNCS);
|
|
|
|
zi = &ztest_info[rand];
|
|
|
|
zc = ZTEST_GET_SHARED_CALLSTATE(rand);
|
|
|
|
call_next = zc->zc_next;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (now >= call_next &&
|
2012-01-24 06:43:32 +04:00
|
|
|
atomic_cas_64(&zc->zc_next, call_next, call_next +
|
|
|
|
ztest_random(2 * zi->zi_interval[0] + 1)) == call_next) {
|
|
|
|
ztest_execute(rand, zi, id);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 21:43:27 +04:00
|
|
|
thread_exit();
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
2022-04-19 21:38:30 +03:00
|
|
|
ztest_dataset_name(char *dsname, const char *pool, int d)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2016-06-16 00:28:36 +03:00
|
|
|
(void) snprintf(dsname, ZFS_MAX_DATASET_NAME_LEN, "%s/ds_%d", pool, d);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_dataset_destroy(int d)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2016-06-16 00:28:36 +03:00
|
|
|
char name[ZFS_MAX_DATASET_NAME_LEN];
|
2010-08-26 20:52:39 +04:00
|
|
|
int t;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_dataset_name(name, ztest_opts.zo_pool, d);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 3)
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("Destroying %s to free up space\n", name);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Cleanup any non-standard clones and snapshots. In general,
|
|
|
|
* ztest thread t operates on dataset (t % zopt_datasets),
|
|
|
|
* so there may be more than one thing to clean up.
|
|
|
|
*/
|
2012-01-24 06:43:32 +04:00
|
|
|
for (t = d; t < ztest_opts.zo_threads;
|
|
|
|
t += ztest_opts.zo_datasets)
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_dsl_dataset_cleanup(name, t);
|
|
|
|
|
|
|
|
(void) dmu_objset_find(name, ztest_objset_destroy_cb, NULL,
|
|
|
|
DS_FIND_SNAPSHOTS | DS_FIND_CHILDREN);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
ztest_dataset_dirobj_verify(ztest_ds_t *zd)
|
|
|
|
{
|
|
|
|
uint64_t usedobjs, dirobjs, scratch;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ZTEST_DIROBJ is the object directory for the entire dataset.
|
|
|
|
* Therefore, the number of objects in use should equal the
|
|
|
|
* number of ZTEST_DIROBJ entries, +1 for ZTEST_DIROBJ itself.
|
|
|
|
* If not, we have an object leak.
|
|
|
|
*
|
|
|
|
* Note that we can only check this in ztest_dataset_open(),
|
|
|
|
* when the open-context and syncing-context values agree.
|
|
|
|
* That's because zap_count() returns the open-context value,
|
|
|
|
* while dmu_objset_space() returns the rootbp fill count.
|
|
|
|
*/
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(zap_count(zd->zd_os, ZTEST_DIROBJ, &dirobjs));
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_objset_space(zd->zd_os, &scratch, &scratch, &usedobjs, &scratch);
|
|
|
|
ASSERT3U(dirobjs + 1, ==, usedobjs);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_dataset_open(int d)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_ds_t *zd = &ztest_ds[d];
|
|
|
|
uint64_t committed_seq = ZTEST_GET_SHARED_DS(d)->zd_seq;
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *os;
|
|
|
|
zilog_t *zilog;
|
2016-06-16 00:28:36 +03:00
|
|
|
char name[ZFS_MAX_DATASET_NAME_LEN];
|
2010-05-29 00:45:14 +04:00
|
|
|
int error;
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_dataset_name(name, ztest_opts.zo_pool, d);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_rdlock(&ztest_name_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
error = ztest_dataset_create(name);
|
|
|
|
if (error == ENOSPC) {
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&ztest_name_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_record_enospc(FTAG);
|
|
|
|
return (error);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(error == 0 || error == EEXIST);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-09-12 23:15:11 +03:00
|
|
|
VERIFY0(ztest_dmu_objset_own(name, DMU_OST_OTHER, B_FALSE,
|
|
|
|
B_TRUE, zd, &os));
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_unlock(&ztest_name_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_zd_init(zd, ZTEST_GET_SHARED_DS(d), os);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
zilog = zd->zd_zilog;
|
|
|
|
|
|
|
|
if (zilog->zl_header->zh_claim_lr_seq != 0 &&
|
|
|
|
zilog->zl_header->zh_claim_lr_seq < committed_seq)
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "missing log records: "
|
|
|
|
"claimed %"PRIu64" < committed %"PRIu64"",
|
2010-05-29 00:45:14 +04:00
|
|
|
zilog->zl_header->zh_claim_lr_seq, committed_seq);
|
|
|
|
|
|
|
|
ztest_dataset_dirobj_verify(zd);
|
|
|
|
|
|
|
|
zil_replay(os, zd, ztest_replay_vector);
|
|
|
|
|
|
|
|
ztest_dataset_dirobj_verify(zd);
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 6)
|
2021-06-05 15:18:37 +03:00
|
|
|
(void) printf("%s replay %"PRIu64" blocks, "
|
|
|
|
"%"PRIu64" records, seq %"PRIu64"\n",
|
2010-05-29 00:45:14 +04:00
|
|
|
zd->zd_name,
|
2021-06-05 15:18:37 +03:00
|
|
|
zilog->zl_parse_blk_count,
|
|
|
|
zilog->zl_parse_lr_count,
|
|
|
|
zilog->zl_replaying_seq);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2022-07-21 03:14:06 +03:00
|
|
|
zilog = zil_open(os, ztest_get_data, NULL);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (zilog->zl_replaying_seq != 0 &&
|
|
|
|
zilog->zl_replaying_seq < committed_seq)
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_FALSE, "missing log records: "
|
|
|
|
"replayed %"PRIu64" < committed %"PRIu64"",
|
2010-05-29 00:45:14 +04:00
|
|
|
zilog->zl_replaying_seq, committed_seq);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_dataset_close(int d)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_ds_t *zd = &ztest_ds[d];
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
zil_close(zd->zd_zilog);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dmu_objset_disown(zd->zd_os, B_TRUE, zd);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
ztest_zd_fini(zd);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2018-11-08 02:40:24 +03:00
|
|
|
static int
|
|
|
|
ztest_replay_zil_cb(const char *name, void *arg)
|
|
|
|
{
|
2021-12-11 03:57:41 +03:00
|
|
|
(void) arg;
|
2018-11-08 02:40:24 +03:00
|
|
|
objset_t *os;
|
|
|
|
ztest_ds_t *zdtmp;
|
|
|
|
|
|
|
|
VERIFY0(ztest_dmu_objset_own(name, DMU_OST_ANY, B_TRUE,
|
|
|
|
B_TRUE, FTAG, &os));
|
|
|
|
|
|
|
|
zdtmp = umem_alloc(sizeof (ztest_ds_t), UMEM_NOFAIL);
|
|
|
|
|
|
|
|
ztest_zd_init(zdtmp, NULL, os);
|
|
|
|
zil_replay(os, zdtmp, ztest_replay_vector);
|
|
|
|
ztest_zd_fini(zdtmp);
|
|
|
|
|
|
|
|
if (dmu_objset_zil(os)->zl_parse_lr_count != 0 &&
|
|
|
|
ztest_opts.zo_verbose >= 6) {
|
|
|
|
zilog_t *zilog = dmu_objset_zil(os);
|
|
|
|
|
2021-06-05 15:18:37 +03:00
|
|
|
(void) printf("%s replay %"PRIu64" blocks, "
|
|
|
|
"%"PRIu64" records, seq %"PRIu64"\n",
|
2018-11-08 02:40:24 +03:00
|
|
|
name,
|
2021-06-05 15:18:37 +03:00
|
|
|
zilog->zl_parse_blk_count,
|
|
|
|
zilog->zl_parse_lr_count,
|
|
|
|
zilog->zl_replaying_seq);
|
2018-11-08 02:40:24 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
umem_free(zdtmp, sizeof (ztest_ds_t));
|
|
|
|
|
|
|
|
dmu_objset_disown(os, B_TRUE, FTAG);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2020-06-06 22:51:35 +03:00
|
|
|
static void
|
|
|
|
ztest_freeze(void)
|
|
|
|
{
|
|
|
|
ztest_ds_t *zd = &ztest_ds[0];
|
|
|
|
spa_t *spa;
|
|
|
|
int numloops = 0;
|
|
|
|
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
/* freeze not supported during RAIDZ expansion */
|
|
|
|
if (ztest_opts.zo_raid_do_expand)
|
|
|
|
return;
|
|
|
|
|
2020-06-06 22:51:35 +03:00
|
|
|
if (ztest_opts.zo_verbose >= 3)
|
|
|
|
(void) printf("testing spa_freeze()...\n");
|
|
|
|
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
raidz_scratch_verify();
|
2020-06-06 22:51:35 +03:00
|
|
|
kernel_init(SPA_MODE_READ | SPA_MODE_WRITE);
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(spa_open(ztest_opts.zo_pool, &spa, FTAG));
|
|
|
|
VERIFY0(ztest_dataset_open(0));
|
2020-06-06 22:51:35 +03:00
|
|
|
ztest_spa = spa;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Force the first log block to be transactionally allocated.
|
|
|
|
* We have to do this before we freeze the pool -- otherwise
|
|
|
|
* the log chain won't be anchored.
|
|
|
|
*/
|
|
|
|
while (BP_IS_HOLE(&zd->zd_zilog->zl_header->zh_log)) {
|
|
|
|
ztest_dmu_object_alloc_free(zd, 0);
|
|
|
|
zil_commit(zd->zd_zilog, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Freeze the pool. This stops spa_sync() from doing anything,
|
|
|
|
* so that the only way to record changes from now on is the ZIL.
|
|
|
|
*/
|
|
|
|
spa_freeze(spa);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Because it is hard to predict how much space a write will actually
|
|
|
|
* require beforehand, we leave ourselves some fudge space to write over
|
|
|
|
* capacity.
|
|
|
|
*/
|
|
|
|
uint64_t capacity = metaslab_class_get_space(spa_normal_class(spa)) / 2;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Run tests that generate log records but don't alter the pool config
|
|
|
|
* or depend on DSL sync tasks (snapshots, objset create/destroy, etc).
|
|
|
|
* We do a txg_wait_synced() after each iteration to force the txg
|
|
|
|
* to increase well beyond the last synced value in the uberblock.
|
|
|
|
* The ZIL should be OK with that.
|
|
|
|
*
|
|
|
|
* Run a random number of times less than zo_maxloops and ensure we do
|
|
|
|
* not run out of space on the pool.
|
|
|
|
*/
|
|
|
|
while (ztest_random(10) != 0 &&
|
|
|
|
numloops++ < ztest_opts.zo_maxloops &&
|
|
|
|
metaslab_class_get_alloc(spa_normal_class(spa)) < capacity) {
|
|
|
|
ztest_od_t od;
|
|
|
|
ztest_od_init(&od, 0, FTAG, 0, DMU_OT_UINT64_OTHER, 0, 0, 0);
|
|
|
|
VERIFY0(ztest_object_init(zd, &od, sizeof (od), B_FALSE));
|
|
|
|
ztest_io(zd, od.od_object,
|
|
|
|
ztest_random(ZTEST_RANGE_LOCKS) << SPA_MAXBLOCKSHIFT);
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Commit all of the changes we just generated.
|
|
|
|
*/
|
|
|
|
zil_commit(zd->zd_zilog, 0);
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Close our dataset and close the pool.
|
|
|
|
*/
|
|
|
|
ztest_dataset_close(0);
|
|
|
|
spa_close(spa, FTAG);
|
|
|
|
kernel_fini();
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Open and close the pool and dataset to induce log replay.
|
|
|
|
*/
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
raidz_scratch_verify();
|
2020-06-06 22:51:35 +03:00
|
|
|
kernel_init(SPA_MODE_READ | SPA_MODE_WRITE);
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(spa_open(ztest_opts.zo_pool, &spa, FTAG));
|
|
|
|
ASSERT3U(spa_freeze_txg(spa), ==, UINT64_MAX);
|
|
|
|
VERIFY0(ztest_dataset_open(0));
|
2020-06-06 22:51:35 +03:00
|
|
|
ztest_spa = spa;
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
ztest_dataset_close(0);
|
|
|
|
ztest_reguid(NULL, 0);
|
|
|
|
|
|
|
|
spa_close(spa, FTAG);
|
|
|
|
kernel_fini();
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2021-12-11 03:57:41 +03:00
|
|
|
ztest_import_impl(void)
|
2020-06-06 22:51:35 +03:00
|
|
|
{
|
|
|
|
importargs_t args = { 0 };
|
|
|
|
nvlist_t *cfg = NULL;
|
|
|
|
int nsearch = 1;
|
|
|
|
char *searchdirs[nsearch];
|
|
|
|
int flags = ZFS_IMPORT_MISSING_LOG;
|
|
|
|
|
|
|
|
searchdirs[0] = ztest_opts.zo_dir;
|
|
|
|
args.paths = nsearch;
|
|
|
|
args.path = searchdirs;
|
|
|
|
args.can_be_active = B_FALSE;
|
|
|
|
|
2022-09-26 16:40:43 +03:00
|
|
|
libpc_handle_t lpch = {
|
|
|
|
.lpc_lib_handle = NULL,
|
|
|
|
.lpc_ops = &libzpool_config_ops,
|
|
|
|
.lpc_printerr = B_TRUE
|
|
|
|
};
|
|
|
|
VERIFY0(zpool_find_config(&lpch, ztest_opts.zo_pool, &cfg, &args));
|
2020-06-06 22:51:35 +03:00
|
|
|
VERIFY0(spa_import(ztest_opts.zo_pool, cfg, NULL, flags));
|
2020-12-23 20:52:24 +03:00
|
|
|
fnvlist_free(cfg);
|
2020-06-06 22:51:35 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Import a storage pool with the given name.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
ztest_import(ztest_shared_t *zs)
|
|
|
|
{
|
|
|
|
spa_t *spa;
|
|
|
|
|
|
|
|
mutex_init(&ztest_vdev_lock, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
mutex_init(&ztest_checkpoint_lock, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
VERIFY0(pthread_rwlock_init(&ztest_name_lock, NULL));
|
|
|
|
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
raidz_scratch_verify();
|
2020-06-06 22:51:35 +03:00
|
|
|
kernel_init(SPA_MODE_READ | SPA_MODE_WRITE);
|
|
|
|
|
2021-12-11 03:57:41 +03:00
|
|
|
ztest_import_impl();
|
2020-06-06 22:51:35 +03:00
|
|
|
|
|
|
|
VERIFY0(spa_open(ztest_opts.zo_pool, &spa, FTAG));
|
|
|
|
zs->zs_metaslab_sz =
|
|
|
|
1ULL << spa->spa_root_vdev->vdev_child[0]->vdev_ms_shift;
|
2023-09-23 02:08:51 +03:00
|
|
|
zs->zs_guid = spa_guid(spa);
|
2020-06-06 22:51:35 +03:00
|
|
|
spa_close(spa, FTAG);
|
|
|
|
|
|
|
|
kernel_fini();
|
|
|
|
|
|
|
|
if (!ztest_opts.zo_mmp_test) {
|
2023-09-23 02:08:51 +03:00
|
|
|
ztest_run_zdb(zs->zs_guid);
|
2020-06-06 22:51:35 +03:00
|
|
|
ztest_freeze();
|
2023-09-23 02:08:51 +03:00
|
|
|
ztest_run_zdb(zs->zs_guid);
|
2020-06-06 22:51:35 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
(void) pthread_rwlock_destroy(&ztest_name_lock);
|
|
|
|
mutex_destroy(&ztest_vdev_lock);
|
|
|
|
mutex_destroy(&ztest_checkpoint_lock);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
* After the expansion was killed, check that the pool is healthy
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
ztest_raidz_expand_check(spa_t *spa)
|
|
|
|
{
|
|
|
|
ASSERT3U(ztest_opts.zo_raidz_expand_test, ==, RAIDZ_EXPAND_KILLED);
|
|
|
|
/*
|
|
|
|
* Set pool check done flag, main program will run a zdb check
|
|
|
|
* of the pool when we exit.
|
|
|
|
*/
|
|
|
|
ztest_shared_opts->zo_raidz_expand_test = RAIDZ_EXPAND_CHECKED;
|
|
|
|
|
|
|
|
/* Wait for reflow to finish */
|
|
|
|
if (ztest_opts.zo_verbose >= 1) {
|
|
|
|
(void) printf("\nwaiting for reflow to finish ...\n");
|
|
|
|
}
|
|
|
|
pool_raidz_expand_stat_t rzx_stats;
|
|
|
|
pool_raidz_expand_stat_t *pres = &rzx_stats;
|
|
|
|
do {
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
(void) poll(NULL, 0, 500); /* wait 1/2 second */
|
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
|
|
|
|
(void) spa_raidz_expand_get_stats(spa, pres);
|
|
|
|
spa_config_exit(spa, SCL_CONFIG, FTAG);
|
|
|
|
} while (pres->pres_state != DSS_FINISHED &&
|
|
|
|
pres->pres_reflowed < pres->pres_to_reflow);
|
|
|
|
|
|
|
|
if (ztest_opts.zo_verbose >= 1) {
|
|
|
|
(void) printf("verifying an interrupted raidz "
|
|
|
|
"expansion using a pool scrub ...\n");
|
|
|
|
}
|
2024-10-09 06:41:17 +03:00
|
|
|
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
/* Will fail here if there is non-recoverable corruption detected */
|
2024-10-09 06:41:17 +03:00
|
|
|
int error = ztest_scrub_impl(spa);
|
|
|
|
if (error == EBUSY)
|
|
|
|
error = 0;
|
|
|
|
|
|
|
|
VERIFY0(error);
|
|
|
|
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
if (ztest_opts.zo_verbose >= 1) {
|
|
|
|
(void) printf("raidz expansion scrub check complete\n");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Start a raidz expansion test. We run some I/O on the pool for a while
|
|
|
|
* to get some data in the pool. Then we grow the raidz and
|
|
|
|
* kill the test at the requested offset into the reflow, verifying that
|
|
|
|
* doing such does not lead to pool corruption.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
ztest_raidz_expand_run(ztest_shared_t *zs, spa_t *spa)
|
|
|
|
{
|
|
|
|
nvlist_t *root;
|
|
|
|
pool_raidz_expand_stat_t rzx_stats;
|
|
|
|
pool_raidz_expand_stat_t *pres = &rzx_stats;
|
|
|
|
kthread_t **run_threads;
|
|
|
|
vdev_t *cvd, *rzvd = spa->spa_root_vdev->vdev_child[0];
|
|
|
|
int total_disks = rzvd->vdev_children;
|
|
|
|
int data_disks = total_disks - vdev_get_nparity(rzvd);
|
|
|
|
uint64_t alloc_goal;
|
|
|
|
uint64_t csize;
|
|
|
|
int error, t;
|
|
|
|
int threads = ztest_opts.zo_threads;
|
|
|
|
ztest_expand_io_t *thread_args;
|
|
|
|
|
|
|
|
ASSERT3U(ztest_opts.zo_raidz_expand_test, !=, RAIDZ_EXPAND_NONE);
|
2024-04-22 20:48:58 +03:00
|
|
|
ASSERT3P(rzvd->vdev_ops, ==, &vdev_raidz_ops);
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
ztest_opts.zo_raidz_expand_test = RAIDZ_EXPAND_STARTED;
|
|
|
|
|
|
|
|
/* Setup a 1 MiB buffer of random data */
|
|
|
|
uint64_t bufsize = 1024 * 1024;
|
|
|
|
void *buffer = umem_alloc(bufsize, UMEM_NOFAIL);
|
|
|
|
|
|
|
|
if (read(ztest_fd_rand, buffer, bufsize) != bufsize) {
|
|
|
|
fatal(B_TRUE, "short read from /dev/urandom");
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* Put some data in the pool and then attach a vdev to initiate
|
|
|
|
* reflow.
|
|
|
|
*/
|
|
|
|
run_threads = umem_zalloc(threads * sizeof (kthread_t *), UMEM_NOFAIL);
|
|
|
|
thread_args = umem_zalloc(threads * sizeof (ztest_expand_io_t),
|
|
|
|
UMEM_NOFAIL);
|
|
|
|
/* Aim for roughly 25% of allocatable space up to 1GB */
|
|
|
|
alloc_goal = (vdev_get_min_asize(rzvd) * data_disks) / total_disks;
|
|
|
|
alloc_goal = MIN(alloc_goal >> 2, 1024*1024*1024);
|
|
|
|
if (ztest_opts.zo_verbose >= 1) {
|
|
|
|
(void) printf("adding data to pool '%s', goal %llu bytes\n",
|
|
|
|
ztest_opts.zo_pool, (u_longlong_t)alloc_goal);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Kick off all the I/O generators that run in parallel.
|
|
|
|
*/
|
|
|
|
for (t = 0; t < threads; t++) {
|
|
|
|
if (t < ztest_opts.zo_datasets && ztest_dataset_open(t) != 0) {
|
|
|
|
umem_free(run_threads, threads * sizeof (kthread_t *));
|
|
|
|
umem_free(buffer, bufsize);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
thread_args[t].rzx_id = t;
|
|
|
|
thread_args[t].rzx_amount = alloc_goal / threads;
|
|
|
|
thread_args[t].rzx_bufsize = bufsize;
|
|
|
|
thread_args[t].rzx_buffer = buffer;
|
|
|
|
thread_args[t].rzx_alloc_max = alloc_goal;
|
|
|
|
thread_args[t].rzx_spa = spa;
|
|
|
|
run_threads[t] = thread_create(NULL, 0, ztest_rzx_thread,
|
|
|
|
&thread_args[t], 0, NULL, TS_RUN | TS_JOINABLE,
|
|
|
|
defclsyspri);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Wait for all of the writers to complete.
|
|
|
|
*/
|
|
|
|
for (t = 0; t < threads; t++)
|
|
|
|
VERIFY0(thread_join(run_threads[t]));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Close all datasets. This must be done after all the threads
|
|
|
|
* are joined so we can be sure none of the datasets are in-use
|
|
|
|
* by any of the threads.
|
|
|
|
*/
|
|
|
|
for (t = 0; t < ztest_opts.zo_threads; t++) {
|
|
|
|
if (t < ztest_opts.zo_datasets)
|
|
|
|
ztest_dataset_close(t);
|
|
|
|
}
|
|
|
|
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
|
|
|
|
zs->zs_alloc = metaslab_class_get_alloc(spa_normal_class(spa));
|
|
|
|
zs->zs_space = metaslab_class_get_space(spa_normal_class(spa));
|
|
|
|
|
|
|
|
umem_free(buffer, bufsize);
|
|
|
|
umem_free(run_threads, threads * sizeof (kthread_t *));
|
|
|
|
umem_free(thread_args, threads * sizeof (ztest_expand_io_t));
|
|
|
|
|
|
|
|
/* Set our reflow target to 25%, 50% or 75% of allocated size */
|
|
|
|
uint_t multiple = ztest_random(3) + 1;
|
|
|
|
uint64_t reflow_max = (rzvd->vdev_stat.vs_alloc * multiple) / 4;
|
|
|
|
raidz_expand_max_reflow_bytes = reflow_max;
|
|
|
|
|
|
|
|
if (ztest_opts.zo_verbose >= 1) {
|
|
|
|
(void) printf("running raidz expansion test, killing when "
|
|
|
|
"reflow reaches %llu bytes (%u/4 of allocated space)\n",
|
|
|
|
(u_longlong_t)reflow_max, multiple);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* XXX - do we want some I/O load during the reflow? */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Use a disk size that is larger than existing ones
|
|
|
|
*/
|
|
|
|
cvd = rzvd->vdev_child[0];
|
|
|
|
csize = vdev_get_min_asize(cvd);
|
|
|
|
csize += csize / 10;
|
|
|
|
/*
|
|
|
|
* Path to vdev to be attached
|
|
|
|
*/
|
|
|
|
char *newpath = umem_alloc(MAXPATHLEN, UMEM_NOFAIL);
|
|
|
|
(void) snprintf(newpath, MAXPATHLEN, ztest_dev_template,
|
|
|
|
ztest_opts.zo_dir, ztest_opts.zo_pool, rzvd->vdev_children);
|
|
|
|
/*
|
|
|
|
* Build the nvlist describing newpath.
|
|
|
|
*/
|
|
|
|
root = make_vdev_root(newpath, NULL, NULL, csize, ztest_get_ashift(),
|
|
|
|
NULL, 0, 0, 1);
|
|
|
|
/*
|
|
|
|
* Expand the raidz vdev by attaching the new disk
|
|
|
|
*/
|
|
|
|
if (ztest_opts.zo_verbose >= 1) {
|
|
|
|
(void) printf("expanding raidz: %d wide to %d wide with '%s'\n",
|
|
|
|
(int)rzvd->vdev_children, (int)rzvd->vdev_children + 1,
|
|
|
|
newpath);
|
|
|
|
}
|
|
|
|
error = spa_vdev_attach(spa, rzvd->vdev_guid, root, B_FALSE, B_FALSE);
|
|
|
|
nvlist_free(root);
|
|
|
|
if (error != 0) {
|
|
|
|
fatal(0, "raidz expand: attach (%s %llu) returned %d",
|
|
|
|
newpath, (long long)csize, error);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Wait for reflow to begin
|
|
|
|
*/
|
|
|
|
while (spa->spa_raidz_expand == NULL) {
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
(void) poll(NULL, 0, 100); /* wait 1/10 second */
|
|
|
|
}
|
|
|
|
spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
|
|
|
|
(void) spa_raidz_expand_get_stats(spa, pres);
|
|
|
|
spa_config_exit(spa, SCL_CONFIG, FTAG);
|
|
|
|
while (pres->pres_state != DSS_SCANNING) {
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
(void) poll(NULL, 0, 100); /* wait 1/10 second */
|
|
|
|
spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
|
|
|
|
(void) spa_raidz_expand_get_stats(spa, pres);
|
|
|
|
spa_config_exit(spa, SCL_CONFIG, FTAG);
|
|
|
|
}
|
|
|
|
|
|
|
|
ASSERT3U(pres->pres_state, ==, DSS_SCANNING);
|
|
|
|
ASSERT3U(pres->pres_to_reflow, !=, 0);
|
|
|
|
/*
|
|
|
|
* Set so when we are killed we go to raidz checking rather than
|
|
|
|
* restarting test.
|
|
|
|
*/
|
|
|
|
ztest_shared_opts->zo_raidz_expand_test = RAIDZ_EXPAND_KILLED;
|
|
|
|
if (ztest_opts.zo_verbose >= 1) {
|
|
|
|
(void) printf("raidz expansion reflow started, waiting for "
|
|
|
|
"%llu bytes to be copied\n", (u_longlong_t)reflow_max);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Wait for reflow maximum to be reached and then kill the test
|
|
|
|
*/
|
|
|
|
while (pres->pres_reflowed < reflow_max) {
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
(void) poll(NULL, 0, 100); /* wait 1/10 second */
|
|
|
|
spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
|
|
|
|
(void) spa_raidz_expand_get_stats(spa, pres);
|
|
|
|
spa_config_exit(spa, SCL_CONFIG, FTAG);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Reset the reflow pause before killing */
|
|
|
|
raidz_expand_max_reflow_bytes = 0;
|
|
|
|
|
|
|
|
if (ztest_opts.zo_verbose >= 1) {
|
|
|
|
(void) printf("killing raidz expansion test after reflow "
|
|
|
|
"reached %llu bytes\n", (u_longlong_t)pres->pres_reflowed);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Kill ourself to simulate a panic during a reflow. Our parent will
|
|
|
|
* restart the test and the changed flag value will drive the test
|
|
|
|
* through the scrub/check code to verify the pool is not corrupted.
|
|
|
|
*/
|
|
|
|
ztest_kill(zs);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
ztest_generic_run(ztest_shared_t *zs, spa_t *spa)
|
|
|
|
{
|
|
|
|
kthread_t **run_threads;
|
|
|
|
int t;
|
|
|
|
|
|
|
|
run_threads = umem_zalloc(ztest_opts.zo_threads * sizeof (kthread_t *),
|
|
|
|
UMEM_NOFAIL);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Kick off all the tests that run in parallel.
|
|
|
|
*/
|
|
|
|
for (t = 0; t < ztest_opts.zo_threads; t++) {
|
|
|
|
if (t < ztest_opts.zo_datasets && ztest_dataset_open(t) != 0) {
|
|
|
|
umem_free(run_threads, ztest_opts.zo_threads *
|
|
|
|
sizeof (kthread_t *));
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
run_threads[t] = thread_create(NULL, 0, ztest_thread,
|
|
|
|
(void *)(uintptr_t)t, 0, NULL, TS_RUN | TS_JOINABLE,
|
|
|
|
defclsyspri);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Wait for all of the tests to complete.
|
|
|
|
*/
|
|
|
|
for (t = 0; t < ztest_opts.zo_threads; t++)
|
|
|
|
VERIFY0(thread_join(run_threads[t]));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Close all datasets. This must be done after all the threads
|
|
|
|
* are joined so we can be sure none of the datasets are in-use
|
|
|
|
* by any of the threads.
|
|
|
|
*/
|
|
|
|
for (t = 0; t < ztest_opts.zo_threads; t++) {
|
|
|
|
if (t < ztest_opts.zo_datasets)
|
|
|
|
ztest_dataset_close(t);
|
|
|
|
}
|
|
|
|
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
|
|
|
|
zs->zs_alloc = metaslab_class_get_alloc(spa_normal_class(spa));
|
|
|
|
zs->zs_space = metaslab_class_get_space(spa_normal_class(spa));
|
|
|
|
|
|
|
|
umem_free(run_threads, ztest_opts.zo_threads * sizeof (kthread_t *));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Setup our test context and kick off threads to run tests on all datasets
|
|
|
|
* in parallel.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
static void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_run(ztest_shared_t *zs)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
spa_t *spa;
|
2011-11-12 02:07:54 +04:00
|
|
|
objset_t *os;
|
2018-10-10 23:48:33 +03:00
|
|
|
kthread_t *resume_thread, *deadman_thread;
|
2010-08-26 21:43:27 +04:00
|
|
|
uint64_t object;
|
2010-05-29 00:45:14 +04:00
|
|
|
int error;
|
2010-08-26 20:52:39 +04:00
|
|
|
int t, d;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
|
|
|
ztest_exiting = B_FALSE;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Initialize parent/child shared state.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_init(&ztest_vdev_lock, NULL, MUTEX_DEFAULT, NULL);
|
2016-12-17 01:11:29 +03:00
|
|
|
mutex_init(&ztest_checkpoint_lock, NULL, MUTEX_DEFAULT, NULL);
|
2018-06-05 02:52:10 +03:00
|
|
|
VERIFY0(pthread_rwlock_init(&ztest_name_lock, NULL));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
zs->zs_thread_start = gethrtime();
|
2012-01-24 06:43:32 +04:00
|
|
|
zs->zs_thread_stop =
|
|
|
|
zs->zs_thread_start + ztest_opts.zo_passtime * NANOSEC;
|
2010-05-29 00:45:14 +04:00
|
|
|
zs->zs_thread_stop = MIN(zs->zs_thread_stop, zs->zs_proc_stop);
|
|
|
|
zs->zs_thread_kill = zs->zs_thread_stop;
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_random(100) < ztest_opts.zo_killrate) {
|
|
|
|
zs->zs_thread_kill -=
|
|
|
|
ztest_random(ztest_opts.zo_passtime * NANOSEC);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 21:43:27 +04:00
|
|
|
mutex_init(&zcl.zcl_callbacks_lock, NULL, MUTEX_DEFAULT, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
list_create(&zcl.zcl_callbacks, sizeof (ztest_cb_data_t),
|
|
|
|
offsetof(ztest_cb_data_t, zcd_node));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2020-06-06 22:51:35 +03:00
|
|
|
* Open our pool. It may need to be imported first depending on
|
|
|
|
* what tests were running when the previous pass was terminated.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
raidz_scratch_verify();
|
2019-11-21 20:32:57 +03:00
|
|
|
kernel_init(SPA_MODE_READ | SPA_MODE_WRITE);
|
2020-06-06 22:51:35 +03:00
|
|
|
error = spa_open(ztest_opts.zo_pool, &spa, FTAG);
|
|
|
|
if (error) {
|
|
|
|
VERIFY3S(error, ==, ENOENT);
|
2021-12-11 03:57:41 +03:00
|
|
|
ztest_import_impl();
|
2020-06-06 22:51:35 +03:00
|
|
|
VERIFY0(spa_open(ztest_opts.zo_pool, &spa, FTAG));
|
|
|
|
zs->zs_metaslab_sz =
|
|
|
|
1ULL << spa->spa_root_vdev->vdev_child[0]->vdev_ms_shift;
|
|
|
|
}
|
|
|
|
|
2014-06-13 03:29:11 +04:00
|
|
|
metaslab_preload_limit = ztest_random(20) + 1;
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_spa = spa;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
/*
|
|
|
|
* XXX - BUGBUG raidz expansion do not run this for generic for now
|
|
|
|
*/
|
|
|
|
if (ztest_opts.zo_raidz_expand_test != RAIDZ_EXPAND_NONE)
|
|
|
|
VERIFY0(vdev_raidz_impl_set("cycle"));
|
2019-07-12 19:31:20 +03:00
|
|
|
|
2017-01-26 23:30:43 +03:00
|
|
|
dmu_objset_stats_t dds;
|
2017-09-12 23:15:11 +03:00
|
|
|
VERIFY0(ztest_dmu_objset_own(ztest_opts.zo_pool,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
DMU_OST_ANY, B_TRUE, B_TRUE, FTAG, &os));
|
2017-01-26 23:30:43 +03:00
|
|
|
dsl_pool_config_enter(dmu_objset_pool(os), FTAG);
|
|
|
|
dmu_objset_fast_stat(os, &dds);
|
|
|
|
dsl_pool_config_exit(dmu_objset_pool(os), FTAG);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dmu_objset_disown(os, B_TRUE, FTAG);
|
2011-11-12 02:07:54 +04:00
|
|
|
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
/* Give the dedicated raidz expansion test more grace time */
|
|
|
|
if (ztest_opts.zo_raidz_expand_test != RAIDZ_EXPAND_NONE)
|
|
|
|
zfs_deadman_synctime_ms *= 2;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* Create a thread to periodically resume suspended I/O.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
Simplify threads, mutexs, cvs and rwlocks
* Simplify threads, mutexs, cvs and rwlocks
* Update the zk_thread_create() function to use the same trick
as Illumos. Specifically, cast the new pthread_t to a void
pointer and return that as the kthread_t *. This avoids the
issues associated with managing a wrapper structure and is
safe as long as the callers never attempt to dereference it.
* Update all function prototypes passed to pthread_create() to
match the expected prototype. We were getting away this with
before since the function were explicitly cast.
* Replaced direct zk_thread_create() calls with thread_create()
for code consistency. All consumers of libzpool now use the
proper wrappers.
* The mutex_held() calls were converted to MUTEX_HELD().
* Removed all mutex_owner() calls and retired the interface.
Instead use MUTEX_HELD() which provides the same information
and allows the implementation details to be hidden. In this
case the use of the pthread_equals() function.
* The kthread_t, kmutex_t, krwlock_t, and krwlock_t types had
any non essential fields removed. In the case of kthread_t
and kcondvar_t they could be directly typedef'd to pthread_t
and pthread_cond_t respectively.
* Removed all extra ASSERTS from the thread, mutex, rwlock, and
cv wrapper functions. In practice, pthreads already provides
the vast majority of checks as long as we check the return
code. Removing this code from our wrappers help readability.
* Added TS_JOINABLE state flag to pass to request a joinable rather
than detached thread. This isn't a standard thread_create() state
but it's the least invasive way to pass this information and is
only used by ztest.
TEST_ZTEST_TIMEOUT=3600
Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4547
Closes #5503
Closes #5523
Closes #6377
Closes #6495
2017-08-11 18:51:44 +03:00
|
|
|
resume_thread = thread_create(NULL, 0, ztest_resume_thread,
|
|
|
|
spa, 0, NULL, TS_RUN | TS_JOINABLE, defclsyspri);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
2017-12-19 01:06:07 +03:00
|
|
|
* Create a deadman thread and set to panic if we hang.
|
2010-05-29 00:45:14 +04:00
|
|
|
*/
|
2018-10-10 23:48:33 +03:00
|
|
|
deadman_thread = thread_create(NULL, 0, ztest_deadman_thread,
|
2017-12-19 01:06:07 +03:00
|
|
|
zs, 0, NULL, TS_RUN | TS_JOINABLE, defclsyspri);
|
|
|
|
|
|
|
|
spa->spa_deadman_failmode = ZIO_FAILURE_MODE_PANIC;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2016-12-17 01:11:29 +03:00
|
|
|
* Verify that we can safely inquire about any object,
|
2008-11-20 23:01:55 +03:00
|
|
|
* whether it's allocated or not. To make it interesting,
|
|
|
|
* we probe a 5-wide window around each power of two.
|
|
|
|
* This hits all edge cases, including zero and the max.
|
|
|
|
*/
|
2010-08-26 20:52:39 +04:00
|
|
|
for (t = 0; t < 64; t++) {
|
|
|
|
for (d = -5; d <= 5; d++) {
|
2008-11-20 23:01:55 +03:00
|
|
|
error = dmu_object_info(spa->spa_meta_objset,
|
|
|
|
(1ULL << t) + d, NULL);
|
|
|
|
ASSERT(error == 0 || error == ENOENT ||
|
|
|
|
error == EINVAL);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* If we got any ENOSPC errors on the previous run, destroy something.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
if (zs->zs_enospc_count != 0) {
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
/* Not expecting ENOSPC errors during raidz expansion tests */
|
|
|
|
ASSERT3U(ztest_opts.zo_raidz_expand_test, ==,
|
|
|
|
RAIDZ_EXPAND_NONE);
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
int d = ztest_random(ztest_opts.zo_datasets);
|
|
|
|
ztest_dataset_destroy(d);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
zs->zs_enospc_count = 0;
|
|
|
|
|
2018-11-29 07:47:09 +03:00
|
|
|
/*
|
|
|
|
* If we were in the middle of ztest_device_removal() and were killed
|
|
|
|
* we need to ensure the removal and scrub complete before running
|
|
|
|
* any tests that check ztest_device_removal_active. The removal will
|
|
|
|
* be restarted automatically when the spa is opened, but we need to
|
2019-01-13 21:11:52 +03:00
|
|
|
* initiate the scrub manually if it is not already in progress. Note
|
2018-11-29 07:47:09 +03:00
|
|
|
* that we always run the scrub whenever an indirect vdev exists
|
|
|
|
* because we have no way of knowing for sure if ztest_device_removal()
|
|
|
|
* fully completed its scrub before the pool was reimported.
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
*
|
|
|
|
* Does not apply for the RAIDZ expansion specific test runs
|
2018-11-29 07:47:09 +03:00
|
|
|
*/
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
if (ztest_opts.zo_raidz_expand_test == RAIDZ_EXPAND_NONE &&
|
|
|
|
(spa->spa_removing_phys.sr_state == DSS_SCANNING ||
|
|
|
|
spa->spa_removing_phys.sr_prev_indirect_vdev != -1)) {
|
2018-11-29 07:47:09 +03:00
|
|
|
while (spa->spa_removing_phys.sr_state == DSS_SCANNING)
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
|
2019-01-18 20:47:55 +03:00
|
|
|
error = ztest_scrub_impl(spa);
|
|
|
|
if (error == EBUSY)
|
|
|
|
error = 0;
|
|
|
|
ASSERT0(error);
|
2018-11-29 07:47:09 +03:00
|
|
|
}
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 4)
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("starting main threads...\n");
|
|
|
|
|
2018-11-08 02:40:24 +03:00
|
|
|
/*
|
|
|
|
* Replay all logs of all datasets in the pool. This is primarily for
|
|
|
|
* temporary datasets which wouldn't otherwise get replayed, which
|
|
|
|
* can trigger failures when attempting to offline a SLOG in
|
|
|
|
* ztest_fault_inject().
|
|
|
|
*/
|
|
|
|
(void) dmu_objset_find(ztest_opts.zo_pool, ztest_replay_zil_cb,
|
|
|
|
NULL, DS_FIND_CHILDREN);
|
|
|
|
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
if (ztest_opts.zo_raidz_expand_test == RAIDZ_EXPAND_REQUESTED)
|
|
|
|
ztest_raidz_expand_run(zs, spa);
|
|
|
|
else if (ztest_opts.zo_raidz_expand_test == RAIDZ_EXPAND_KILLED)
|
|
|
|
ztest_raidz_expand_check(spa);
|
|
|
|
else
|
|
|
|
ztest_generic_run(zs, spa);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2018-10-10 23:48:33 +03:00
|
|
|
/* Kill the resume and deadman threads */
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_exiting = B_TRUE;
|
Simplify threads, mutexs, cvs and rwlocks
* Simplify threads, mutexs, cvs and rwlocks
* Update the zk_thread_create() function to use the same trick
as Illumos. Specifically, cast the new pthread_t to a void
pointer and return that as the kthread_t *. This avoids the
issues associated with managing a wrapper structure and is
safe as long as the callers never attempt to dereference it.
* Update all function prototypes passed to pthread_create() to
match the expected prototype. We were getting away this with
before since the function were explicitly cast.
* Replaced direct zk_thread_create() calls with thread_create()
for code consistency. All consumers of libzpool now use the
proper wrappers.
* The mutex_held() calls were converted to MUTEX_HELD().
* Removed all mutex_owner() calls and retired the interface.
Instead use MUTEX_HELD() which provides the same information
and allows the implementation details to be hidden. In this
case the use of the pthread_equals() function.
* The kthread_t, kmutex_t, krwlock_t, and krwlock_t types had
any non essential fields removed. In the case of kthread_t
and kcondvar_t they could be directly typedef'd to pthread_t
and pthread_cond_t respectively.
* Removed all extra ASSERTS from the thread, mutex, rwlock, and
cv wrapper functions. In practice, pthreads already provides
the vast majority of checks as long as we check the return
code. Removing this code from our wrappers help readability.
* Added TS_JOINABLE state flag to pass to request a joinable rather
than detached thread. This isn't a standard thread_create() state
but it's the least invasive way to pass this information and is
only used by ztest.
TEST_ZTEST_TIMEOUT=3600
Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4547
Closes #5503
Closes #5523
Closes #6377
Closes #6495
2017-08-11 18:51:44 +03:00
|
|
|
VERIFY0(thread_join(resume_thread));
|
2018-10-10 23:48:33 +03:00
|
|
|
VERIFY0(thread_join(deadman_thread));
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_resume(spa);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Right before closing the pool, kick off a bunch of async I/O;
|
|
|
|
* spa_close() should wait for it to complete.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2015-12-22 04:31:57 +03:00
|
|
|
for (object = 1; object < 50; object++) {
|
|
|
|
dmu_prefetch(spa->spa_meta_objset, object, 0, 0, 1ULL << 20,
|
|
|
|
ZIO_PRIORITY_SYNC_READ);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-26 21:17:18 +04:00
|
|
|
/* Verify that at least one commit cb was called in a timely fashion */
|
|
|
|
if (zc_cb_counter >= ZTEST_COMMIT_CB_MIN_REG)
|
2013-05-11 01:17:03 +04:00
|
|
|
VERIFY0(zc_min_txg_delay);
|
2010-08-26 21:17:18 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_close(spa, FTAG);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that we can loop over all pools.
|
|
|
|
*/
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
for (spa = spa_next(NULL); spa != NULL; spa = spa_next(spa))
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose > 3)
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("spa_next: found %s\n", spa_name(spa));
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that we can export the pool and reimport it under a
|
|
|
|
* different name.
|
|
|
|
*/
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
if ((ztest_random(2) == 0) && !ztest_opts.zo_mmp_test) {
|
2016-06-16 00:28:36 +03:00
|
|
|
char name[ZFS_MAX_DATASET_NAME_LEN];
|
|
|
|
(void) snprintf(name, sizeof (name), "%s_import",
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_opts.zo_pool);
|
|
|
|
ztest_spa_import_export(ztest_opts.zo_pool, name);
|
|
|
|
ztest_spa_import_export(name, ztest_opts.zo_pool);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
kernel_fini();
|
2010-08-27 01:24:34 +04:00
|
|
|
|
|
|
|
list_destroy(&zcl.zcl_callbacks);
|
2010-08-26 22:59:11 +04:00
|
|
|
mutex_destroy(&zcl.zcl_callbacks_lock);
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_destroy(&ztest_name_lock);
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_destroy(&ztest_vdev_lock);
|
2016-12-17 01:11:29 +03:00
|
|
|
mutex_destroy(&ztest_checkpoint_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2020-06-15 21:30:37 +03:00
|
|
|
static void
|
2008-11-20 23:01:55 +03:00
|
|
|
print_time(hrtime_t t, char *timebuf)
|
|
|
|
{
|
|
|
|
hrtime_t s = t / NANOSEC;
|
|
|
|
hrtime_t m = s / 60;
|
|
|
|
hrtime_t h = m / 60;
|
|
|
|
hrtime_t d = h / 24;
|
|
|
|
|
|
|
|
s -= m * 60;
|
|
|
|
m -= h * 60;
|
|
|
|
h -= d * 24;
|
|
|
|
|
|
|
|
timebuf[0] = '\0';
|
|
|
|
|
|
|
|
if (d)
|
|
|
|
(void) sprintf(timebuf,
|
|
|
|
"%llud%02lluh%02llum%02llus", d, h, m, s);
|
|
|
|
else if (h)
|
|
|
|
(void) sprintf(timebuf, "%lluh%02llum%02llus", h, m, s);
|
|
|
|
else if (m)
|
|
|
|
(void) sprintf(timebuf, "%llum%02llus", m, s);
|
|
|
|
else
|
|
|
|
(void) sprintf(timebuf, "%llus", s);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static nvlist_t *
|
ddt: dedup table quota enforcement
This adds two new pool properties:
- dedup_table_size, the total size of all DDTs on the pool; and
- dedup_table_quota, the maximum possible size of all DDTs in the pool
When set, quota will be enforced by checking when a new entry is about
to be created. If the pool is over its dedup quota, the entry won't be
created, and the corresponding write will be converted to a regular
non-dedup write. Note that existing entries can be updated (ie their
refcounts changed), as that reuses the space rather than requiring more.
dedup_table_quota can be set to 'auto', which will set it based on the
size of the devices backing the "dedup" allocation device. This makes it
possible to limit the DDTs to the size of a dedup vdev only, such that
when the device fills, no new blocks are deduplicated.
Sponsored-by: iXsystems, Inc.
Sponsored-By: Klara Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Co-authored-by: Sean Eric Fagan <sean.fagan@klarasystems.com>
Closes #15889
2024-07-25 19:47:36 +03:00
|
|
|
make_random_pool_props(void)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
|
|
|
nvlist_t *props;
|
|
|
|
|
2021-01-11 20:30:33 +03:00
|
|
|
props = fnvlist_alloc();
|
2018-09-06 04:33:36 +03:00
|
|
|
|
ddt: dedup table quota enforcement
This adds two new pool properties:
- dedup_table_size, the total size of all DDTs on the pool; and
- dedup_table_quota, the maximum possible size of all DDTs in the pool
When set, quota will be enforced by checking when a new entry is about
to be created. If the pool is over its dedup quota, the entry won't be
created, and the corresponding write will be converted to a regular
non-dedup write. Note that existing entries can be updated (ie their
refcounts changed), as that reuses the space rather than requiring more.
dedup_table_quota can be set to 'auto', which will set it based on the
size of the devices backing the "dedup" allocation device. This makes it
possible to limit the DDTs to the size of a dedup vdev only, such that
when the device fills, no new blocks are deduplicated.
Sponsored-by: iXsystems, Inc.
Sponsored-By: Klara Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Co-authored-by: Sean Eric Fagan <sean.fagan@klarasystems.com>
Closes #15889
2024-07-25 19:47:36 +03:00
|
|
|
/* Twenty percent of the time enable ZPOOL_PROP_DEDUP_TABLE_QUOTA */
|
|
|
|
if (ztest_random(5) == 0) {
|
|
|
|
fnvlist_add_uint64(props,
|
|
|
|
zpool_prop_to_name(ZPOOL_PROP_DEDUP_TABLE_QUOTA),
|
|
|
|
2 * 1024 * 1024);
|
|
|
|
}
|
2018-02-05 23:00:26 +03:00
|
|
|
|
ddt: dedup table quota enforcement
This adds two new pool properties:
- dedup_table_size, the total size of all DDTs on the pool; and
- dedup_table_quota, the maximum possible size of all DDTs in the pool
When set, quota will be enforced by checking when a new entry is about
to be created. If the pool is over its dedup quota, the entry won't be
created, and the corresponding write will be converted to a regular
non-dedup write. Note that existing entries can be updated (ie their
refcounts changed), as that reuses the space rather than requiring more.
dedup_table_quota can be set to 'auto', which will set it based on the
size of the devices backing the "dedup" allocation device. This makes it
possible to limit the DDTs to the size of a dedup vdev only, such that
when the device fills, no new blocks are deduplicated.
Sponsored-by: iXsystems, Inc.
Sponsored-By: Klara Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Co-authored-by: Sean Eric Fagan <sean.fagan@klarasystems.com>
Closes #15889
2024-07-25 19:47:36 +03:00
|
|
|
/* Fifty percent of the time enable ZPOOL_PROP_AUTOREPLACE */
|
|
|
|
if (ztest_random(2) == 0) {
|
|
|
|
fnvlist_add_uint64(props,
|
|
|
|
zpool_prop_to_name(ZPOOL_PROP_AUTOREPLACE), 1);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
return (props);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Create a storage pool with the given name and initial vdev size.
|
2010-05-29 00:45:14 +04:00
|
|
|
* Then test spa_freeze() functionality.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
static void
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_init(ztest_shared_t *zs)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
spa_t *spa;
|
2010-05-29 00:45:14 +04:00
|
|
|
nvlist_t *nvroot, *props;
|
2012-12-14 03:24:15 +04:00
|
|
|
int i;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_init(&ztest_vdev_lock, NULL, MUTEX_DEFAULT, NULL);
|
2016-12-17 01:11:29 +03:00
|
|
|
mutex_init(&ztest_checkpoint_lock, NULL, MUTEX_DEFAULT, NULL);
|
2018-06-05 02:52:10 +03:00
|
|
|
VERIFY0(pthread_rwlock_init(&ztest_name_lock, NULL));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
raidz_scratch_verify();
|
2019-11-21 20:32:57 +03:00
|
|
|
kernel_init(SPA_MODE_READ | SPA_MODE_WRITE);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Create the storage pool.
|
|
|
|
*/
|
2012-01-24 06:43:32 +04:00
|
|
|
(void) spa_destroy(ztest_opts.zo_pool);
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_shared->zs_vdev_next_leaf = 0;
|
|
|
|
zs->zs_splits = 0;
|
2012-01-24 06:43:32 +04:00
|
|
|
zs->zs_mirrors = ztest_opts.zo_mirrors;
|
2012-12-15 04:28:49 +04:00
|
|
|
nvroot = make_vdev_root(NULL, NULL, NULL, ztest_opts.zo_vdev_size, 0,
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
NULL, ztest_opts.zo_raid_children, zs->zs_mirrors, 1);
|
ddt: dedup table quota enforcement
This adds two new pool properties:
- dedup_table_size, the total size of all DDTs on the pool; and
- dedup_table_quota, the maximum possible size of all DDTs in the pool
When set, quota will be enforced by checking when a new entry is about
to be created. If the pool is over its dedup quota, the entry won't be
created, and the corresponding write will be converted to a regular
non-dedup write. Note that existing entries can be updated (ie their
refcounts changed), as that reuses the space rather than requiring more.
dedup_table_quota can be set to 'auto', which will set it based on the
size of the devices backing the "dedup" allocation device. This makes it
possible to limit the DDTs to the size of a dedup vdev only, such that
when the device fills, no new blocks are deduplicated.
Sponsored-by: iXsystems, Inc.
Sponsored-By: Klara Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Co-authored-by: Sean Eric Fagan <sean.fagan@klarasystems.com>
Closes #15889
2024-07-25 19:47:36 +03:00
|
|
|
props = make_random_pool_props();
|
2018-02-05 23:00:26 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We don't expect the pool to suspend unless maxfaults == 0,
|
|
|
|
* in which case ztest_fault_inject() temporarily takes away
|
|
|
|
* the only valid replica.
|
|
|
|
*/
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_add_uint64(props,
|
2018-02-05 23:00:26 +03:00
|
|
|
zpool_prop_to_name(ZPOOL_PROP_FAILUREMODE),
|
2021-01-11 20:30:33 +03:00
|
|
|
MAXFAULTS(zs) ? ZIO_FAILURE_MODE_PANIC : ZIO_FAILURE_MODE_WAIT);
|
2018-02-05 23:00:26 +03:00
|
|
|
|
2012-12-14 03:24:15 +04:00
|
|
|
for (i = 0; i < SPA_FEATURES; i++) {
|
|
|
|
char *buf;
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
|
2021-02-28 04:16:02 +03:00
|
|
|
if (!spa_feature_table[i].fi_zfs_mod_supported)
|
|
|
|
continue;
|
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
/*
|
|
|
|
* 75% chance of using the log space map feature. We want ztest
|
|
|
|
* to exercise both the code paths that use the log space map
|
|
|
|
* feature and the ones that don't.
|
|
|
|
*/
|
|
|
|
if (i == SPA_FEATURE_LOG_SPACEMAP && ztest_random(4) == 0)
|
2024-06-18 01:35:18 +03:00
|
|
|
continue;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* split 50/50 between legacy and fast dedup
|
|
|
|
*/
|
|
|
|
if (i == SPA_FEATURE_FAST_DEDUP && ztest_random(2) != 0)
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
continue;
|
|
|
|
|
2012-12-14 03:24:15 +04:00
|
|
|
VERIFY3S(-1, !=, asprintf(&buf, "feature@%s",
|
|
|
|
spa_feature_table[i].fi_uname));
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_add_uint64(props, buf, 0);
|
2012-12-14 03:24:15 +04:00
|
|
|
free(buf);
|
|
|
|
}
|
2018-02-05 23:00:26 +03:00
|
|
|
|
|
|
|
VERIFY0(spa_create(ztest_opts.zo_pool, nvroot, props, NULL, NULL));
|
2021-01-11 20:30:33 +03:00
|
|
|
fnvlist_free(nvroot);
|
|
|
|
fnvlist_free(props);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(spa_open(ztest_opts.zo_pool, &spa, FTAG));
|
2012-01-24 06:43:32 +04:00
|
|
|
zs->zs_metaslab_sz =
|
|
|
|
1ULL << spa->spa_root_vdev->vdev_child[0]->vdev_ms_shift;
|
2023-09-23 02:08:51 +03:00
|
|
|
zs->zs_guid = spa_guid(spa);
|
2008-11-20 23:01:55 +03:00
|
|
|
spa_close(spa, FTAG);
|
|
|
|
|
|
|
|
kernel_fini();
|
2010-05-29 00:45:14 +04:00
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
if (!ztest_opts.zo_mmp_test) {
|
2023-09-23 02:08:51 +03:00
|
|
|
ztest_run_zdb(zs->zs_guid);
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
ztest_freeze();
|
2023-09-23 02:08:51 +03:00
|
|
|
ztest_run_zdb(zs->zs_guid);
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
}
|
2012-01-24 06:43:32 +04:00
|
|
|
|
2018-06-05 02:52:10 +03:00
|
|
|
(void) pthread_rwlock_destroy(&ztest_name_lock);
|
2012-01-24 06:43:32 +04:00
|
|
|
mutex_destroy(&ztest_vdev_lock);
|
2016-12-17 01:11:29 +03:00
|
|
|
mutex_destroy(&ztest_checkpoint_lock);
|
2012-01-24 06:43:32 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2012-10-04 18:09:12 +04:00
|
|
|
setup_data_fd(void)
|
2012-01-24 06:43:32 +04:00
|
|
|
{
|
2012-10-04 22:36:52 +04:00
|
|
|
static char ztest_name_data[] = "/tmp/ztest.data.XXXXXX";
|
|
|
|
|
|
|
|
ztest_fd_data = mkstemp(ztest_name_data);
|
2012-10-04 18:09:12 +04:00
|
|
|
ASSERT3S(ztest_fd_data, >=, 0);
|
2012-10-04 22:36:52 +04:00
|
|
|
(void) unlink(ztest_name_data);
|
2012-01-24 06:43:32 +04:00
|
|
|
}
|
|
|
|
|
2012-05-21 23:11:39 +04:00
|
|
|
static int
|
|
|
|
shared_data_size(ztest_shared_hdr_t *hdr)
|
|
|
|
{
|
|
|
|
int size;
|
|
|
|
|
|
|
|
size = hdr->zh_hdr_size;
|
|
|
|
size += hdr->zh_opts_size;
|
|
|
|
size += hdr->zh_size;
|
|
|
|
size += hdr->zh_stats_size * hdr->zh_stats_count;
|
|
|
|
size += hdr->zh_ds_size * hdr->zh_ds_count;
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
size += hdr->zh_scratch_state_size;
|
2012-05-21 23:11:39 +04:00
|
|
|
|
|
|
|
return (size);
|
|
|
|
}
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
static void
|
|
|
|
setup_hdr(void)
|
|
|
|
{
|
2012-05-21 23:11:39 +04:00
|
|
|
int size;
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_shared_hdr_t *hdr;
|
|
|
|
|
|
|
|
hdr = (void *)mmap(0, P2ROUNDUP(sizeof (*hdr), getpagesize()),
|
2012-10-04 18:09:12 +04:00
|
|
|
PROT_READ | PROT_WRITE, MAP_SHARED, ztest_fd_data, 0);
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3P(hdr, !=, MAP_FAILED);
|
2012-01-24 06:43:32 +04:00
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(ftruncate(ztest_fd_data, sizeof (ztest_shared_hdr_t)));
|
2012-05-21 23:11:39 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
hdr->zh_hdr_size = sizeof (ztest_shared_hdr_t);
|
|
|
|
hdr->zh_opts_size = sizeof (ztest_shared_opts_t);
|
|
|
|
hdr->zh_size = sizeof (ztest_shared_t);
|
|
|
|
hdr->zh_stats_size = sizeof (ztest_shared_callstate_t);
|
|
|
|
hdr->zh_stats_count = ZTEST_FUNCS;
|
|
|
|
hdr->zh_ds_size = sizeof (ztest_shared_ds_t);
|
|
|
|
hdr->zh_ds_count = ztest_opts.zo_datasets;
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
hdr->zh_scratch_state_size = sizeof (ztest_shared_scratch_state_t);
|
2012-01-24 06:43:32 +04:00
|
|
|
|
2012-05-21 23:11:39 +04:00
|
|
|
size = shared_data_size(hdr);
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY0(ftruncate(ztest_fd_data, size));
|
2012-05-21 23:11:39 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
(void) munmap((caddr_t)hdr, P2ROUNDUP(sizeof (*hdr), getpagesize()));
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
setup_data(void)
|
|
|
|
{
|
|
|
|
int size, offset;
|
|
|
|
ztest_shared_hdr_t *hdr;
|
|
|
|
uint8_t *buf;
|
|
|
|
|
|
|
|
hdr = (void *)mmap(0, P2ROUNDUP(sizeof (*hdr), getpagesize()),
|
2012-10-04 18:09:12 +04:00
|
|
|
PROT_READ, MAP_SHARED, ztest_fd_data, 0);
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3P(hdr, !=, MAP_FAILED);
|
2012-01-24 06:43:32 +04:00
|
|
|
|
2012-05-21 23:11:39 +04:00
|
|
|
size = shared_data_size(hdr);
|
2012-01-24 06:43:32 +04:00
|
|
|
|
|
|
|
(void) munmap((caddr_t)hdr, P2ROUNDUP(sizeof (*hdr), getpagesize()));
|
|
|
|
hdr = ztest_shared_hdr = (void *)mmap(0, P2ROUNDUP(size, getpagesize()),
|
2012-10-04 18:09:12 +04:00
|
|
|
PROT_READ | PROT_WRITE, MAP_SHARED, ztest_fd_data, 0);
|
2021-01-13 04:21:01 +03:00
|
|
|
ASSERT3P(hdr, !=, MAP_FAILED);
|
2012-01-24 06:43:32 +04:00
|
|
|
buf = (uint8_t *)hdr;
|
|
|
|
|
|
|
|
offset = hdr->zh_hdr_size;
|
|
|
|
ztest_shared_opts = (void *)&buf[offset];
|
|
|
|
offset += hdr->zh_opts_size;
|
|
|
|
ztest_shared = (void *)&buf[offset];
|
|
|
|
offset += hdr->zh_size;
|
|
|
|
ztest_shared_callstate = (void *)&buf[offset];
|
|
|
|
offset += hdr->zh_stats_size * hdr->zh_stats_count;
|
|
|
|
ztest_shared_ds = (void *)&buf[offset];
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
offset += hdr->zh_ds_size * hdr->zh_ds_count;
|
|
|
|
ztest_scratch_state = (void *)&buf[offset];
|
2012-01-24 06:43:32 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static boolean_t
|
|
|
|
exec_child(char *cmd, char *libpath, boolean_t ignorekill, int *statusp)
|
|
|
|
{
|
|
|
|
pid_t pid;
|
|
|
|
int status;
|
2012-10-04 22:14:04 +04:00
|
|
|
char *cmdbuf = NULL;
|
2012-01-24 06:43:32 +04:00
|
|
|
|
|
|
|
pid = fork();
|
|
|
|
|
|
|
|
if (cmd == NULL) {
|
2012-10-04 22:14:04 +04:00
|
|
|
cmdbuf = umem_alloc(MAXPATHLEN, UMEM_NOFAIL);
|
|
|
|
(void) strlcpy(cmdbuf, getexecname(), MAXPATHLEN);
|
2012-01-24 06:43:32 +04:00
|
|
|
cmd = cmdbuf;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (pid == -1)
|
2021-06-05 15:18:37 +03:00
|
|
|
fatal(B_TRUE, "fork failed");
|
2012-01-24 06:43:32 +04:00
|
|
|
|
|
|
|
if (pid == 0) { /* child */
|
2012-10-04 18:09:12 +04:00
|
|
|
char fd_data_str[12];
|
2012-01-24 06:43:32 +04:00
|
|
|
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY3S(11, >=,
|
|
|
|
snprintf(fd_data_str, 12, "%d", ztest_fd_data));
|
|
|
|
VERIFY0(setenv("ZTEST_FD_DATA", fd_data_str, 1));
|
2012-10-04 18:09:12 +04:00
|
|
|
|
2022-05-03 17:55:14 +03:00
|
|
|
if (libpath != NULL) {
|
|
|
|
const char *curlp = getenv("LD_LIBRARY_PATH");
|
|
|
|
if (curlp == NULL)
|
|
|
|
VERIFY0(setenv("LD_LIBRARY_PATH", libpath, 1));
|
|
|
|
else {
|
|
|
|
char *newlp = NULL;
|
|
|
|
VERIFY3S(-1, !=,
|
|
|
|
asprintf(&newlp, "%s:%s", libpath, curlp));
|
|
|
|
VERIFY0(setenv("LD_LIBRARY_PATH", newlp, 1));
|
2022-09-14 02:53:21 +03:00
|
|
|
free(newlp);
|
2022-05-03 17:55:14 +03:00
|
|
|
}
|
|
|
|
}
|
Remove enable_extended_FILE_stdio()
Even on Illumos it's only available in the 32-bit programming
environment, and, quoth enable_extended_FILE_stdio(3C):
> Historically, 32-bit Solaris applications have been limited to using
> only the file descriptors 0 through 255 with the standard I/O
> functions (see stdio(3C)) in the C library. The extended FILE
> facility allows well-behaved 32-bit applications to use any
> valid file descriptor with the standard I/O functions.
where "well-behaved" means that it
> does not directly access any fields in the FILE structure pointed
> to by the FILE pointer associated with any standard I/O stream,
And the stdio/flush.c implementation reads:
/*
* if this is not an internal extended FILE then check
* if _file is being changed from underneath us.
* It should not be because if
* it is then then we lose our ability to guard against
* silent data corruption.
*/
if (!iop->__xf_nocheck && bad_fd > -1 && iop->_magic != bad_fd) {
(void) fprintf(stderr,
"Application violated extended FILE safety mechanism.\n"
"Please read the man page for extendedFILE.\nAborting\n");
abort();
}
This appears to be an insane workaround for broken implementation with
exposed FILE internals and _file being an u8, both only on non-LP64;
it's shimmed out on all LP64 targets in Illumos,
and we shim it out as well: just get rid of it
This appears to've been originally fixed in illumos-gate
a5f69788de7ac07553de47f7fec8c05a9a94c105 ("PSARC 2006/162 Extended FILE
space for 32-bit Solaris processes", "1085341 32-bit stdio routines
should support file descriptors >255"), which also bears extendedFILE
and enable_extended_FILE_stdio(3C):
- unsigned char _file; /* UNIX System file descriptor */
+ unsigned char _magic; /* Old home of the file descriptor */
+ /* Only fileno(3C) can retrieve the
value now */
and
+/*
+ * Macros to aid the extended fd FILE work.
+ * This helps isolate the changes to only the 32-bit code
+ * since 64-bit Solaris is not affected by this.
+ */
+#ifdef _LP64
+#define GET_FD(iop) ((iop)->_file)
+#define SET_FILE(iop, fd) ((iop)->_file = (fd))
+#else
+#define GET_FD(iop) \
+ (((iop)->__extendedfd) ? _file_get(iop) : (iop)->_magic)
+#define SET_FILE(iop, fd) (iop)->_magic = (fd); (iop)->__extendedfd = 0
+#endif
Also remove the 1k setrlimit(NOFILE) calls: that's the default on Linux,
with 64k on Illumos and 171k on FreeBSD
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #13411
2022-05-03 15:52:53 +03:00
|
|
|
(void) execl(cmd, cmd, (char *)NULL);
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_dump_core = B_FALSE;
|
|
|
|
fatal(B_TRUE, "exec failed: %s", cmd);
|
|
|
|
}
|
|
|
|
|
2012-10-04 22:14:04 +04:00
|
|
|
if (cmdbuf != NULL) {
|
|
|
|
umem_free(cmdbuf, MAXPATHLEN);
|
|
|
|
cmd = NULL;
|
|
|
|
}
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
while (waitpid(pid, &status, 0) != pid)
|
|
|
|
continue;
|
|
|
|
if (statusp != NULL)
|
|
|
|
*statusp = status;
|
|
|
|
|
|
|
|
if (WIFEXITED(status)) {
|
|
|
|
if (WEXITSTATUS(status) != 0) {
|
|
|
|
(void) fprintf(stderr, "child exited with code %d\n",
|
|
|
|
WEXITSTATUS(status));
|
|
|
|
exit(2);
|
|
|
|
}
|
|
|
|
return (B_FALSE);
|
|
|
|
} else if (WIFSIGNALED(status)) {
|
|
|
|
if (!ignorekill || WTERMSIG(status) != SIGKILL) {
|
|
|
|
(void) fprintf(stderr, "child died with signal %d\n",
|
|
|
|
WTERMSIG(status));
|
|
|
|
exit(3);
|
|
|
|
}
|
|
|
|
return (B_TRUE);
|
|
|
|
} else {
|
|
|
|
(void) fprintf(stderr, "something strange happened to child\n");
|
|
|
|
exit(4);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
ztest_run_init(void)
|
|
|
|
{
|
|
|
|
int i;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_shared_t *zs = ztest_shared;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
/*
|
|
|
|
* Blow away any existing copy of zpool.cache
|
|
|
|
*/
|
|
|
|
(void) remove(spa_config_path);
|
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
if (ztest_opts.zo_init == 0) {
|
|
|
|
if (ztest_opts.zo_verbose >= 1)
|
|
|
|
(void) printf("Importing pool %s\n",
|
|
|
|
ztest_opts.zo_pool);
|
|
|
|
ztest_import(zs);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
/*
|
|
|
|
* Create and initialize our storage pool.
|
|
|
|
*/
|
|
|
|
for (i = 1; i <= ztest_opts.zo_init; i++) {
|
2022-02-25 16:26:54 +03:00
|
|
|
memset(zs, 0, sizeof (*zs));
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 3 &&
|
|
|
|
ztest_opts.zo_init != 1) {
|
|
|
|
(void) printf("ztest_init(), pass %d\n", i);
|
|
|
|
}
|
|
|
|
ztest_init(zs);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
main(int argc, char **argv)
|
|
|
|
{
|
|
|
|
int kills = 0;
|
|
|
|
int iters = 0;
|
2012-01-24 06:43:32 +04:00
|
|
|
int older = 0;
|
|
|
|
int newer = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
ztest_shared_t *zs;
|
|
|
|
ztest_info_t *zi;
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_shared_callstate_t *zc;
|
2008-11-20 23:01:55 +03:00
|
|
|
char timebuf[100];
|
2017-06-13 12:16:45 +03:00
|
|
|
char numbuf[NN_NUMBUF_SZ];
|
2012-10-04 22:14:04 +04:00
|
|
|
char *cmd;
|
2012-01-24 06:43:32 +04:00
|
|
|
boolean_t hasalt;
|
2021-02-16 13:14:44 +03:00
|
|
|
int f, err;
|
2012-10-04 18:09:12 +04:00
|
|
|
char *fd_data_str = getenv("ZTEST_FD_DATA");
|
2014-10-11 05:05:54 +04:00
|
|
|
struct sigaction action;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
(void) setvbuf(stdout, NULL, _IOLBF, 0);
|
|
|
|
|
2013-05-17 01:18:06 +04:00
|
|
|
dprintf_setup(&argc, argv);
|
2017-12-19 01:06:07 +03:00
|
|
|
zfs_deadman_synctime_ms = 300000;
|
2018-10-10 23:48:33 +03:00
|
|
|
zfs_deadman_checktime_ms = 30000;
|
2017-08-04 19:30:49 +03:00
|
|
|
/*
|
|
|
|
* As two-word space map entries may not come up often (especially
|
|
|
|
* if pool and vdev sizes are small) we want to force at least some
|
|
|
|
* of them so the feature get tested.
|
|
|
|
*/
|
|
|
|
zfs_force_some_double_word_sm_entries = B_TRUE;
|
2013-05-17 01:18:06 +04:00
|
|
|
|
2018-10-01 20:36:34 +03:00
|
|
|
/*
|
|
|
|
* Verify that even extensively damaged split blocks with many
|
|
|
|
* segments can be reconstructed in a reasonable amount of time
|
|
|
|
* when reconstruction is known to be possible.
|
2018-10-18 23:53:27 +03:00
|
|
|
*
|
|
|
|
* Note: the lower this value is, the more damage we inflict, and
|
|
|
|
* the more time ztest spends in recovering that damage. We chose
|
|
|
|
* to induce damage 1/100th of the time so recovery is tested but
|
|
|
|
* not so frequently that ztest doesn't get to test other code paths.
|
2018-10-01 20:36:34 +03:00
|
|
|
*/
|
2018-10-18 23:53:27 +03:00
|
|
|
zfs_reconstruct_indirect_damage_fraction = 100;
|
2018-10-01 20:36:34 +03:00
|
|
|
|
2014-10-11 05:05:54 +04:00
|
|
|
action.sa_handler = sig_handler;
|
|
|
|
sigemptyset(&action.sa_mask);
|
|
|
|
action.sa_flags = 0;
|
|
|
|
|
|
|
|
if (sigaction(SIGSEGV, &action, NULL) < 0) {
|
|
|
|
(void) fprintf(stderr, "ztest: cannot catch SIGSEGV: %s.\n",
|
|
|
|
strerror(errno));
|
|
|
|
exit(EXIT_FAILURE);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (sigaction(SIGABRT, &action, NULL) < 0) {
|
|
|
|
(void) fprintf(stderr, "ztest: cannot catch SIGABRT: %s.\n",
|
|
|
|
strerror(errno));
|
|
|
|
exit(EXIT_FAILURE);
|
|
|
|
}
|
|
|
|
|
2018-01-12 20:36:26 +03:00
|
|
|
/*
|
|
|
|
* Force random_get_bytes() to use /dev/urandom in order to prevent
|
|
|
|
* ztest from needlessly depleting the system entropy pool.
|
|
|
|
*/
|
|
|
|
random_path = "/dev/urandom";
|
2022-05-03 17:58:40 +03:00
|
|
|
ztest_fd_rand = open(random_path, O_RDONLY | O_CLOEXEC);
|
2012-10-04 18:09:12 +04:00
|
|
|
ASSERT3S(ztest_fd_rand, >=, 0);
|
|
|
|
|
|
|
|
if (!fd_data_str) {
|
2012-01-24 06:43:32 +04:00
|
|
|
process_options(argc, argv);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-10-04 18:09:12 +04:00
|
|
|
setup_data_fd();
|
2012-01-24 06:43:32 +04:00
|
|
|
setup_hdr();
|
|
|
|
setup_data();
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(ztest_shared_opts, &ztest_opts,
|
2012-01-24 06:43:32 +04:00
|
|
|
sizeof (*ztest_shared_opts));
|
|
|
|
} else {
|
2012-10-04 18:09:12 +04:00
|
|
|
ztest_fd_data = atoi(fd_data_str);
|
2012-01-24 06:43:32 +04:00
|
|
|
setup_data();
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(&ztest_opts, ztest_shared_opts, sizeof (ztest_opts));
|
2012-01-24 06:43:32 +04:00
|
|
|
}
|
|
|
|
ASSERT3U(ztest_opts.zo_datasets, ==, ztest_shared_hdr->zh_ds_count);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-02-16 13:14:44 +03:00
|
|
|
err = ztest_set_global_vars();
|
|
|
|
if (err != 0 && !fd_data_str) {
|
|
|
|
/* error message done by ztest_set_global_vars */
|
|
|
|
exit(EXIT_FAILURE);
|
|
|
|
} else {
|
|
|
|
/* children should not be spawned if setting gvars fails */
|
|
|
|
VERIFY3S(err, ==, 0);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/* Override location of zpool.cache */
|
2021-01-13 04:21:01 +03:00
|
|
|
VERIFY3S(asprintf((char **)&spa_config_path, "%s/zpool.cache",
|
|
|
|
ztest_opts.zo_dir), !=, -1);
|
2012-10-04 18:09:12 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_ds = umem_alloc(ztest_opts.zo_datasets * sizeof (ztest_ds_t),
|
|
|
|
UMEM_NOFAIL);
|
|
|
|
zs = ztest_shared;
|
|
|
|
|
2012-10-04 18:09:12 +04:00
|
|
|
if (fd_data_str) {
|
2017-07-24 21:07:39 +03:00
|
|
|
metaslab_force_ganging = ztest_opts.zo_metaslab_force_ganging;
|
2012-01-24 06:43:32 +04:00
|
|
|
metaslab_df_alloc_threshold =
|
|
|
|
zs->zs_metaslab_df_alloc_threshold;
|
|
|
|
|
|
|
|
if (zs->zs_do_init)
|
|
|
|
ztest_run_init();
|
|
|
|
else
|
|
|
|
ztest_run(zs);
|
|
|
|
exit(0);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
hasalt = (strlen(ztest_opts.zo_alt_ztest) != 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 1) {
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
(void) printf("%"PRIu64" vdevs, %d datasets, %d threads, "
|
|
|
|
"%d %s disks, parity %d, %"PRIu64" seconds...\n\n",
|
2021-06-05 15:18:37 +03:00
|
|
|
ztest_opts.zo_vdevs,
|
2012-01-24 06:43:32 +04:00
|
|
|
ztest_opts.zo_datasets,
|
|
|
|
ztest_opts.zo_threads,
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
ztest_opts.zo_raid_children,
|
|
|
|
ztest_opts.zo_raid_type,
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
ztest_opts.zo_raid_parity,
|
2021-06-05 15:18:37 +03:00
|
|
|
ztest_opts.zo_time);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2012-10-04 22:14:04 +04:00
|
|
|
cmd = umem_alloc(MAXNAMELEN, UMEM_NOFAIL);
|
|
|
|
(void) strlcpy(cmd, getexecname(), MAXNAMELEN);
|
2012-01-24 06:43:32 +04:00
|
|
|
|
|
|
|
zs->zs_do_init = B_TRUE;
|
|
|
|
if (strlen(ztest_opts.zo_alt_ztest) != 0) {
|
|
|
|
if (ztest_opts.zo_verbose >= 1) {
|
|
|
|
(void) printf("Executing older ztest for "
|
|
|
|
"initialization: %s\n", ztest_opts.zo_alt_ztest);
|
|
|
|
}
|
|
|
|
VERIFY(!exec_child(ztest_opts.zo_alt_ztest,
|
|
|
|
ztest_opts.zo_alt_libpath, B_FALSE, NULL));
|
|
|
|
} else {
|
|
|
|
VERIFY(!exec_child(NULL, NULL, B_FALSE, NULL));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2012-01-24 06:43:32 +04:00
|
|
|
zs->zs_do_init = B_FALSE;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
zs->zs_proc_start = gethrtime();
|
2012-01-24 06:43:32 +04:00
|
|
|
zs->zs_proc_stop = zs->zs_proc_start + ztest_opts.zo_time * NANOSEC;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (f = 0; f < ZTEST_FUNCS; f++) {
|
2012-01-24 06:43:32 +04:00
|
|
|
zi = &ztest_info[f];
|
|
|
|
zc = ZTEST_GET_SHARED_CALLSTATE(f);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (zs->zs_proc_start + zi->zi_interval[0] > zs->zs_proc_stop)
|
2012-01-24 06:43:32 +04:00
|
|
|
zc->zc_next = UINT64_MAX;
|
2008-11-20 23:01:55 +03:00
|
|
|
else
|
2012-01-24 06:43:32 +04:00
|
|
|
zc->zc_next = zs->zs_proc_start +
|
2010-05-29 00:45:14 +04:00
|
|
|
ztest_random(2 * zi->zi_interval[0] + 1);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Run the tests in a loop. These tests include fault injection
|
|
|
|
* to verify that self-healing data works, and forced crashes
|
|
|
|
* to verify that we never lose on-disk consistency.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
while (gethrtime() < zs->zs_proc_stop) {
|
2008-11-20 23:01:55 +03:00
|
|
|
int status;
|
2012-01-24 06:43:32 +04:00
|
|
|
boolean_t killed;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Initialize the workload counters for each function.
|
|
|
|
*/
|
2010-08-26 20:52:39 +04:00
|
|
|
for (f = 0; f < ZTEST_FUNCS; f++) {
|
2012-01-24 06:43:32 +04:00
|
|
|
zc = ZTEST_GET_SHARED_CALLSTATE(f);
|
|
|
|
zc->zc_count = 0;
|
|
|
|
zc->zc_time = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
/* Set the allocation switch size */
|
2012-01-24 06:43:32 +04:00
|
|
|
zs->zs_metaslab_df_alloc_threshold =
|
|
|
|
ztest_random(zs->zs_metaslab_sz / 4) + 1;
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (!hasalt || ztest_random(2) == 0) {
|
|
|
|
if (hasalt && ztest_opts.zo_verbose >= 1) {
|
|
|
|
(void) printf("Executing newer ztest: %s\n",
|
|
|
|
cmd);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2012-01-24 06:43:32 +04:00
|
|
|
newer++;
|
|
|
|
killed = exec_child(cmd, NULL, B_TRUE, &status);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
2012-01-24 06:43:32 +04:00
|
|
|
if (hasalt && ztest_opts.zo_verbose >= 1) {
|
|
|
|
(void) printf("Executing older ztest: %s\n",
|
|
|
|
ztest_opts.zo_alt_ztest);
|
|
|
|
}
|
|
|
|
older++;
|
|
|
|
killed = exec_child(ztest_opts.zo_alt_ztest,
|
|
|
|
ztest_opts.zo_alt_libpath, B_TRUE, &status);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (killed)
|
|
|
|
kills++;
|
2008-11-20 23:01:55 +03:00
|
|
|
iters++;
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 1) {
|
2008-11-20 23:01:55 +03:00
|
|
|
hrtime_t now = gethrtime();
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
now = MIN(now, zs->zs_proc_stop);
|
|
|
|
print_time(zs->zs_proc_stop - now, timebuf);
|
2017-06-13 12:16:45 +03:00
|
|
|
nicenum(zs->zs_space, numbuf, sizeof (numbuf));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-06-05 15:18:37 +03:00
|
|
|
(void) printf("Pass %3d, %8s, %3"PRIu64" ENOSPC, "
|
2008-11-20 23:01:55 +03:00
|
|
|
"%4.1f%% of %5s used, %3.0f%% done, %8s to go\n",
|
|
|
|
iters,
|
|
|
|
WIFEXITED(status) ? "Complete" : "SIGKILL",
|
2021-06-05 15:18:37 +03:00
|
|
|
zs->zs_enospc_count,
|
2008-11-20 23:01:55 +03:00
|
|
|
100.0 * zs->zs_alloc / zs->zs_space,
|
|
|
|
numbuf,
|
2010-05-29 00:45:14 +04:00
|
|
|
100.0 * (now - zs->zs_proc_start) /
|
2012-01-24 06:43:32 +04:00
|
|
|
(ztest_opts.zo_time * NANOSEC), timebuf);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 2) {
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("\nWorkload summary:\n\n");
|
|
|
|
(void) printf("%7s %9s %s\n",
|
|
|
|
"Calls", "Time", "Function");
|
|
|
|
(void) printf("%7s %9s %s\n",
|
|
|
|
"-----", "----", "--------");
|
2010-08-26 20:52:39 +04:00
|
|
|
for (f = 0; f < ZTEST_FUNCS; f++) {
|
2012-01-24 06:43:32 +04:00
|
|
|
zi = &ztest_info[f];
|
|
|
|
zc = ZTEST_GET_SHARED_CALLSTATE(f);
|
|
|
|
print_time(zc->zc_time, timebuf);
|
2021-06-05 15:18:37 +03:00
|
|
|
(void) printf("%7"PRIu64" %9s %s\n",
|
|
|
|
zc->zc_count, timebuf,
|
2015-02-24 21:53:31 +03:00
|
|
|
zi->zi_funcname);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
(void) printf("\n");
|
|
|
|
}
|
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
if (!ztest_opts.zo_mmp_test)
|
2023-09-23 02:08:51 +03:00
|
|
|
ztest_run_zdb(zs->zs_guid);
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
if (ztest_shared_opts->zo_raidz_expand_test ==
|
|
|
|
RAIDZ_EXPAND_CHECKED)
|
|
|
|
break; /* raidz expand test complete */
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (ztest_opts.zo_verbose >= 1) {
|
|
|
|
if (hasalt) {
|
|
|
|
(void) printf("%d runs of older ztest: %s\n", older,
|
|
|
|
ztest_opts.zo_alt_ztest);
|
|
|
|
(void) printf("%d runs of newer ztest: %s\n", newer,
|
|
|
|
cmd);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("%d killed, %d completed, %.0f%% kill rate\n",
|
|
|
|
kills, iters - kills, (100.0 * kills) / MAX(1, iters));
|
|
|
|
}
|
|
|
|
|
2012-10-04 22:14:04 +04:00
|
|
|
umem_free(cmd, MAXNAMELEN);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
return (0);
|
|
|
|
}
|