2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* CDDL HEADER START
|
|
|
|
*
|
|
|
|
* The contents of this file are subject to the terms of the
|
|
|
|
* Common Development and Distribution License (the "License").
|
|
|
|
* You may not use this file except in compliance with the License.
|
|
|
|
*
|
|
|
|
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
|
2022-07-12 00:16:13 +03:00
|
|
|
* or https://opensource.org/licenses/CDDL-1.0.
|
2008-11-20 23:01:55 +03:00
|
|
|
* See the License for the specific language governing permissions
|
|
|
|
* and limitations under the License.
|
|
|
|
*
|
|
|
|
* When distributing Covered Code, include this CDDL HEADER in each
|
|
|
|
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
|
|
|
|
* If applicable, add the following below this CDDL HEADER, with the
|
|
|
|
* fields enclosed by brackets "[]" replaced with your own identifying
|
|
|
|
* information: Portions Copyright [yyyy] [name of copyright owner]
|
|
|
|
*
|
|
|
|
* CDDL HEADER END
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
|
2020-09-04 20:34:28 +03:00
|
|
|
* Copyright (c) 2011, 2020 by Delphix. All rights reserved.
|
2018-09-13 04:14:42 +03:00
|
|
|
* Copyright (c) 2018, Nexenta Systems, Inc. All rights reserved.
|
2015-04-02 06:44:32 +03:00
|
|
|
* Copyright (c) 2014 Spectra Logic Corporation, All rights reserved.
|
2016-06-16 01:47:05 +03:00
|
|
|
* Copyright 2013 Saso Kiselkov. All rights reserved.
|
2017-04-13 19:40:00 +03:00
|
|
|
* Copyright (c) 2014 Integros [integros.com]
|
|
|
|
* Copyright 2016 Toomas Soome <tsoome@me.com>
|
2014-03-22 13:07:14 +04:00
|
|
|
* Copyright (c) 2016 Actifio, Inc. All rights reserved.
|
2018-08-30 16:13:30 +03:00
|
|
|
* Copyright 2018 Joyent, Inc.
|
2019-11-27 21:15:01 +03:00
|
|
|
* Copyright (c) 2017, 2019, Datto Inc. All rights reserved.
|
2017-05-30 21:39:17 +03:00
|
|
|
* Copyright 2017 Joyent, Inc.
|
2018-09-06 04:33:36 +03:00
|
|
|
* Copyright (c) 2017, Intel Corporation.
|
2021-02-18 08:30:45 +03:00
|
|
|
* Copyright (c) 2021, Colm Buckley <colm@tuatha.org>
|
2011-11-08 04:26:52 +04:00
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2013-06-11 21:12:34 +04:00
|
|
|
* SPA: Storage Pool Allocator
|
|
|
|
*
|
2008-11-20 23:01:55 +03:00
|
|
|
* This file contains all the routines used when modifying on-disk SPA state.
|
|
|
|
* This includes opening, importing, destroying, exporting a pool, and syncing a
|
|
|
|
* pool.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <sys/zfs_context.h>
|
|
|
|
#include <sys/fm/fs/zfs.h>
|
|
|
|
#include <sys/spa_impl.h>
|
|
|
|
#include <sys/zio.h>
|
|
|
|
#include <sys/zio_checksum.h>
|
|
|
|
#include <sys/dmu.h>
|
|
|
|
#include <sys/dmu_tx.h>
|
|
|
|
#include <sys/zap.h>
|
|
|
|
#include <sys/zil.h>
|
2010-05-29 00:45:14 +04:00
|
|
|
#include <sys/ddt.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <sys/vdev_impl.h>
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
#include <sys/vdev_removal.h>
|
|
|
|
#include <sys/vdev_indirect_mapping.h>
|
|
|
|
#include <sys/vdev_indirect_births.h>
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
#include <sys/vdev_initialize.h>
|
2020-07-03 21:05:50 +03:00
|
|
|
#include <sys/vdev_rebuild.h>
|
2019-03-29 19:13:20 +03:00
|
|
|
#include <sys/vdev_trim.h>
|
2010-08-26 22:49:16 +04:00
|
|
|
#include <sys/vdev_disk.h>
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
#include <sys/vdev_draid.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <sys/metaslab.h>
|
2010-05-29 00:45:14 +04:00
|
|
|
#include <sys/metaslab_impl.h>
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
#include <sys/mmp.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <sys/uberblock_impl.h>
|
|
|
|
#include <sys/txg.h>
|
|
|
|
#include <sys/avl.h>
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
#include <sys/bpobj.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <sys/dmu_traverse.h>
|
|
|
|
#include <sys/dmu_objset.h>
|
|
|
|
#include <sys/unique.h>
|
|
|
|
#include <sys/dsl_pool.h>
|
|
|
|
#include <sys/dsl_dataset.h>
|
|
|
|
#include <sys/dsl_dir.h>
|
|
|
|
#include <sys/dsl_prop.h>
|
|
|
|
#include <sys/dsl_synctask.h>
|
|
|
|
#include <sys/fs/zfs.h>
|
|
|
|
#include <sys/arc.h>
|
|
|
|
#include <sys/callb.h>
|
|
|
|
#include <sys/systeminfo.h>
|
2009-07-03 02:44:48 +04:00
|
|
|
#include <sys/zfs_ioctl.h>
|
2010-05-29 00:45:14 +04:00
|
|
|
#include <sys/dsl_scan.h>
|
2012-12-14 03:24:15 +04:00
|
|
|
#include <sys/zfeature.h>
|
2013-09-04 16:00:57 +04:00
|
|
|
#include <sys/dsl_destroy.h>
|
2013-05-28 15:50:38 +04:00
|
|
|
#include <sys/zvol.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
#ifdef _KERNEL
|
2017-05-30 21:39:17 +03:00
|
|
|
#include <sys/fm/protocol.h>
|
|
|
|
#include <sys/fm/util.h>
|
2010-05-29 00:45:14 +04:00
|
|
|
#include <sys/callb.h>
|
2009-02-18 23:51:31 +03:00
|
|
|
#include <sys/zone.h>
|
2019-08-13 17:11:57 +03:00
|
|
|
#include <sys/vmsystm.h>
|
2009-02-18 23:51:31 +03:00
|
|
|
#endif /* _KERNEL */
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
#include "zfs_prop.h"
|
|
|
|
#include "zfs_comutil.h"
|
|
|
|
|
2015-12-31 19:38:59 +03:00
|
|
|
/*
|
|
|
|
* The interval, in seconds, at which failed configuration cache file writes
|
|
|
|
* should be retried.
|
|
|
|
*/
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
int zfs_ccw_retry_interval = 300;
|
2015-12-31 19:38:59 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
typedef enum zti_modes {
|
2013-05-06 23:24:30 +04:00
|
|
|
ZTI_MODE_FIXED, /* value is # of threads (min 1) */
|
|
|
|
ZTI_MODE_BATCH, /* cpu-intensive; value is ignored */
|
Scale worker threads and taskqs with number of CPUs
While use of dynamic taskqs allows to reduce number of idle threads,
hardcoded 8 taskqs of each kind is a big overkill for small systems,
complicating CPU scheduling, increasing I/O reorder, etc, while
providing no real locking benefits, just not needed there.
On another side, 12*8 worker threads per kind are able to overload
almost any system nowadays. For example, pool of several fast SSDs
with SHA256 checksum makes system barely responsive during scrub, or
with dedup enabled barely responsive during large file deletion.
To address both problems this patch introduces ZTI_SCALE macro, alike
to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs,
to be used in places where lock scalability is needed, while request
ordering is not so much. The code is made to create new taskq for
~6 worker threads (less for small systems, but more for very large)
up to 80% of CPU cores (previous 75% was not good for rounding down).
Both number of threads and threads per taskq are now tunable in case
somebody really wants to use all of system power for ZFS.
While obviously some benchmarks show small peak performance reduction
(not so big really, especially on systems with SMT, where use of the
second threads does not give as much performance as the first ones),
they also show dramatic latency reduction and much more smooth user-
space operation in case of high CPU usage by ZFS.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #11966
2021-05-14 19:13:53 +03:00
|
|
|
ZTI_MODE_SCALE, /* Taskqs scale with CPUs. */
|
2013-05-06 23:24:30 +04:00
|
|
|
ZTI_MODE_NULL, /* don't create a taskq */
|
|
|
|
ZTI_NMODES
|
2010-05-29 00:45:14 +04:00
|
|
|
} zti_modes_t;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-05-06 23:24:30 +04:00
|
|
|
#define ZTI_P(n, q) { ZTI_MODE_FIXED, (n), (q) }
|
|
|
|
#define ZTI_PCT(n) { ZTI_MODE_ONLINE_PERCENT, (n), 1 }
|
|
|
|
#define ZTI_BATCH { ZTI_MODE_BATCH, 0, 1 }
|
Scale worker threads and taskqs with number of CPUs
While use of dynamic taskqs allows to reduce number of idle threads,
hardcoded 8 taskqs of each kind is a big overkill for small systems,
complicating CPU scheduling, increasing I/O reorder, etc, while
providing no real locking benefits, just not needed there.
On another side, 12*8 worker threads per kind are able to overload
almost any system nowadays. For example, pool of several fast SSDs
with SHA256 checksum makes system barely responsive during scrub, or
with dedup enabled barely responsive during large file deletion.
To address both problems this patch introduces ZTI_SCALE macro, alike
to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs,
to be used in places where lock scalability is needed, while request
ordering is not so much. The code is made to create new taskq for
~6 worker threads (less for small systems, but more for very large)
up to 80% of CPU cores (previous 75% was not good for rounding down).
Both number of threads and threads per taskq are now tunable in case
somebody really wants to use all of system power for ZFS.
While obviously some benchmarks show small peak performance reduction
(not so big really, especially on systems with SMT, where use of the
second threads does not give as much performance as the first ones),
they also show dramatic latency reduction and much more smooth user-
space operation in case of high CPU usage by ZFS.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #11966
2021-05-14 19:13:53 +03:00
|
|
|
#define ZTI_SCALE { ZTI_MODE_SCALE, 0, 1 }
|
2013-05-06 23:24:30 +04:00
|
|
|
#define ZTI_NULL { ZTI_MODE_NULL, 0, 0 }
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2013-05-06 23:24:30 +04:00
|
|
|
#define ZTI_N(n) ZTI_P(n, 1)
|
|
|
|
#define ZTI_ONE ZTI_N(1)
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
typedef struct zio_taskq_info {
|
2013-05-06 23:24:30 +04:00
|
|
|
zti_modes_t zti_mode;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint_t zti_value;
|
2013-05-06 23:24:30 +04:00
|
|
|
uint_t zti_count;
|
2009-07-03 02:44:48 +04:00
|
|
|
} zio_taskq_info_t;
|
|
|
|
|
|
|
|
static const char *const zio_taskq_types[ZIO_TASKQ_TYPES] = {
|
2010-10-28 21:36:50 +04:00
|
|
|
"iss", "iss_h", "int", "int_h"
|
2009-07-03 02:44:48 +04:00
|
|
|
};
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
2013-05-06 23:24:30 +04:00
|
|
|
* This table defines the taskq settings for each ZFS I/O type. When
|
|
|
|
* initializing a pool, we use this table to create an appropriately sized
|
|
|
|
* taskq. Some operations are low volume and therefore have a small, static
|
|
|
|
* number of threads assigned to their taskqs using the ZTI_N(#) or ZTI_ONE
|
|
|
|
* macros. Other operations process a large amount of data; the ZTI_BATCH
|
|
|
|
* macro causes us to create a taskq oriented for throughput. Some operations
|
2019-03-29 19:13:20 +03:00
|
|
|
* are so high frequency and short-lived that the taskq itself can become a
|
2013-05-06 23:24:30 +04:00
|
|
|
* point of lock contention. The ZTI_P(#, #) macro indicates that we need an
|
|
|
|
* additional degree of parallelism specified by the number of threads per-
|
|
|
|
* taskq and the number of taskqs; when dispatching an event in this case, the
|
Scale worker threads and taskqs with number of CPUs
While use of dynamic taskqs allows to reduce number of idle threads,
hardcoded 8 taskqs of each kind is a big overkill for small systems,
complicating CPU scheduling, increasing I/O reorder, etc, while
providing no real locking benefits, just not needed there.
On another side, 12*8 worker threads per kind are able to overload
almost any system nowadays. For example, pool of several fast SSDs
with SHA256 checksum makes system barely responsive during scrub, or
with dedup enabled barely responsive during large file deletion.
To address both problems this patch introduces ZTI_SCALE macro, alike
to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs,
to be used in places where lock scalability is needed, while request
ordering is not so much. The code is made to create new taskq for
~6 worker threads (less for small systems, but more for very large)
up to 80% of CPU cores (previous 75% was not good for rounding down).
Both number of threads and threads per taskq are now tunable in case
somebody really wants to use all of system power for ZFS.
While obviously some benchmarks show small peak performance reduction
(not so big really, especially on systems with SMT, where use of the
second threads does not give as much performance as the first ones),
they also show dramatic latency reduction and much more smooth user-
space operation in case of high CPU usage by ZFS.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #11966
2021-05-14 19:13:53 +03:00
|
|
|
* particular taskq is chosen at random. ZTI_SCALE is similar to ZTI_BATCH,
|
|
|
|
* but with number of taskqs also scaling with number of CPUs.
|
2013-05-06 23:24:30 +04:00
|
|
|
*
|
|
|
|
* The different taskq priorities are to handle the different contexts (issue
|
|
|
|
* and interrupt) and then to reserve threads for ZIO_PRIORITY_NOW I/Os that
|
|
|
|
* need to be handled with minimum delay.
|
2010-05-29 00:45:14 +04:00
|
|
|
*/
|
2022-01-15 02:37:55 +03:00
|
|
|
static const zio_taskq_info_t zio_taskqs[ZIO_TYPES][ZIO_TASKQ_TYPES] = {
|
2010-05-29 00:45:14 +04:00
|
|
|
/* ISSUE ISSUE_HIGH INTR INTR_HIGH */
|
2013-05-06 23:24:30 +04:00
|
|
|
{ ZTI_ONE, ZTI_NULL, ZTI_ONE, ZTI_NULL }, /* NULL */
|
Scale worker threads and taskqs with number of CPUs
While use of dynamic taskqs allows to reduce number of idle threads,
hardcoded 8 taskqs of each kind is a big overkill for small systems,
complicating CPU scheduling, increasing I/O reorder, etc, while
providing no real locking benefits, just not needed there.
On another side, 12*8 worker threads per kind are able to overload
almost any system nowadays. For example, pool of several fast SSDs
with SHA256 checksum makes system barely responsive during scrub, or
with dedup enabled barely responsive during large file deletion.
To address both problems this patch introduces ZTI_SCALE macro, alike
to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs,
to be used in places where lock scalability is needed, while request
ordering is not so much. The code is made to create new taskq for
~6 worker threads (less for small systems, but more for very large)
up to 80% of CPU cores (previous 75% was not good for rounding down).
Both number of threads and threads per taskq are now tunable in case
somebody really wants to use all of system power for ZFS.
While obviously some benchmarks show small peak performance reduction
(not so big really, especially on systems with SMT, where use of the
second threads does not give as much performance as the first ones),
they also show dramatic latency reduction and much more smooth user-
space operation in case of high CPU usage by ZFS.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #11966
2021-05-14 19:13:53 +03:00
|
|
|
{ ZTI_N(8), ZTI_NULL, ZTI_SCALE, ZTI_NULL }, /* READ */
|
|
|
|
{ ZTI_BATCH, ZTI_N(5), ZTI_SCALE, ZTI_N(5) }, /* WRITE */
|
|
|
|
{ ZTI_SCALE, ZTI_NULL, ZTI_ONE, ZTI_NULL }, /* FREE */
|
2013-05-06 23:24:30 +04:00
|
|
|
{ ZTI_ONE, ZTI_NULL, ZTI_ONE, ZTI_NULL }, /* CLAIM */
|
|
|
|
{ ZTI_ONE, ZTI_NULL, ZTI_ONE, ZTI_NULL }, /* IOCTL */
|
2019-03-29 19:13:20 +03:00
|
|
|
{ ZTI_N(4), ZTI_NULL, ZTI_ONE, ZTI_NULL }, /* TRIM */
|
2009-07-03 02:44:48 +04:00
|
|
|
};
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
static void spa_sync_version(void *arg, dmu_tx_t *tx);
|
|
|
|
static void spa_sync_props(void *arg, dmu_tx_t *tx);
|
2008-12-03 23:09:06 +03:00
|
|
|
static boolean_t spa_has_active_shared_spare(spa_t *spa);
|
2022-04-19 21:38:30 +03:00
|
|
|
static int spa_load_impl(spa_t *spa, spa_import_type_t type,
|
|
|
|
const char **ereport);
|
2010-08-27 01:24:34 +04:00
|
|
|
static void spa_vdev_resilver_done(spa_t *spa);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2022-01-15 02:37:55 +03:00
|
|
|
static uint_t zio_taskq_batch_pct = 80; /* 1 thread per cpu in pset */
|
|
|
|
static uint_t zio_taskq_batch_tpq; /* threads per taskq */
|
|
|
|
static const boolean_t zio_taskq_sysdc = B_TRUE; /* use SDC scheduling class */
|
|
|
|
static const uint_t zio_taskq_basedc = 80; /* base duty cycle */
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2022-01-15 02:37:55 +03:00
|
|
|
static const boolean_t spa_create_process = B_TRUE; /* no process => no sysdc */
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2018-01-31 02:25:19 +03:00
|
|
|
/*
|
|
|
|
* Report any spa_load_verify errors found, but do not fail spa_load.
|
|
|
|
* This is used by zdb to analyze non-idle pools.
|
|
|
|
*/
|
|
|
|
boolean_t spa_load_verify_dryrun = B_FALSE;
|
|
|
|
|
2021-11-09 23:50:39 +03:00
|
|
|
/*
|
|
|
|
* Allow read spacemaps in case of readonly import (spa_mode == SPA_MODE_READ).
|
|
|
|
* This is used by zdb for spacemaps verification.
|
|
|
|
*/
|
|
|
|
boolean_t spa_mode_readable_spacemaps = B_FALSE;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* This (illegal) pool name is used when temporarily importing a spa_t in order
|
|
|
|
* to get the vdev stats associated with the imported devices.
|
|
|
|
*/
|
|
|
|
#define TRYIMPORT_NAME "$import"
|
2008-11-20 23:01:55 +03:00
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
/*
|
|
|
|
* For debugging purposes: print out vdev tree during pool import.
|
|
|
|
*/
|
2022-01-15 02:37:55 +03:00
|
|
|
static int spa_load_print_vdev_tree = B_FALSE;
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* A non-zero value for zfs_max_missing_tvds means that we allow importing
|
|
|
|
* pools with missing top-level vdevs. This is strictly intended for advanced
|
|
|
|
* pool recovery cases since missing data is almost inevitable. Pools with
|
|
|
|
* missing devices can only be imported read-only for safety reasons, and their
|
|
|
|
* fail-mode will be automatically set to "continue".
|
|
|
|
*
|
|
|
|
* With 1 missing vdev we should be able to import the pool and mount all
|
|
|
|
* datasets. User data that was not modified after the missing device has been
|
|
|
|
* added should be recoverable. This means that snapshots created prior to the
|
|
|
|
* addition of that device should be completely intact.
|
|
|
|
*
|
|
|
|
* With 2 missing vdevs, some datasets may fail to mount since there are
|
|
|
|
* dataset statistics that are stored as regular metadata. Some data might be
|
|
|
|
* recoverable if those vdevs were added recently.
|
|
|
|
*
|
|
|
|
* With 3 or more missing vdevs, the pool is severely damaged and MOS entries
|
|
|
|
* may be missing entirely. Chances of data recovery are very low. Note that
|
|
|
|
* there are also risks of performing an inadvertent rewind as we might be
|
|
|
|
* missing all the vdevs with the latest uberblocks.
|
|
|
|
*/
|
|
|
|
unsigned long zfs_max_missing_tvds = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The parameters below are similar to zfs_max_missing_tvds but are only
|
|
|
|
* intended for a preliminary open of the pool with an untrusted config which
|
|
|
|
* might be incomplete or out-dated.
|
|
|
|
*
|
|
|
|
* We are more tolerant for pools opened from a cachefile since we could have
|
|
|
|
* an out-dated cachefile where a device removal was not registered.
|
|
|
|
* We could have set the limit arbitrarily high but in the case where devices
|
|
|
|
* are really missing we would want to return the proper error codes; we chose
|
|
|
|
* SPA_DVAS_PER_BP - 1 so that some copies of the MOS would still be available
|
|
|
|
* and we get a chance to retrieve the trusted config.
|
|
|
|
*/
|
|
|
|
uint64_t zfs_max_missing_tvds_cachefile = SPA_DVAS_PER_BP - 1;
|
2016-12-17 01:11:29 +03:00
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
/*
|
|
|
|
* In the case where config was assembled by scanning device paths (/dev/dsks
|
|
|
|
* by default) we are less tolerant since all the existing devices should have
|
|
|
|
* been detected and we want spa_load to return the right error codes.
|
|
|
|
*/
|
|
|
|
uint64_t zfs_max_missing_tvds_scan = 0;
|
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
/*
|
|
|
|
* Debugging aid that pauses spa_sync() towards the end.
|
|
|
|
*/
|
2022-01-15 02:37:55 +03:00
|
|
|
static const boolean_t zfs_pause_spa_sync = B_FALSE;
|
2016-12-17 01:11:29 +03:00
|
|
|
|
2019-07-26 20:54:14 +03:00
|
|
|
/*
|
|
|
|
* Variables to indicate the livelist condense zthr func should wait at certain
|
|
|
|
* points for the livelist to be removed - used to test condense/destroy races
|
|
|
|
*/
|
2022-01-15 02:37:55 +03:00
|
|
|
static int zfs_livelist_condense_zthr_pause = 0;
|
|
|
|
static int zfs_livelist_condense_sync_pause = 0;
|
2019-07-26 20:54:14 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Variables to track whether or not condense cancellation has been
|
|
|
|
* triggered in testing.
|
|
|
|
*/
|
2022-01-15 02:37:55 +03:00
|
|
|
static int zfs_livelist_condense_sync_cancel = 0;
|
|
|
|
static int zfs_livelist_condense_zthr_cancel = 0;
|
2019-07-26 20:54:14 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Variable to track whether or not extra ALLOC blkptrs were added to a
|
|
|
|
* livelist entry while it was being condensed (caused by the way we track
|
|
|
|
* remapped blkptrs in dbuf_remap_impl)
|
|
|
|
*/
|
2022-01-15 02:37:55 +03:00
|
|
|
static int zfs_livelist_condense_new_alloc = 0;
|
2019-07-26 20:54:14 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* ==========================================================================
|
|
|
|
* SPA properties routines
|
|
|
|
* ==========================================================================
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Add a (source=src, propname=propval) list to an nvlist.
|
|
|
|
*/
|
|
|
|
static void
|
2022-04-19 21:38:30 +03:00
|
|
|
spa_prop_add_list(nvlist_t *nvl, zpool_prop_t prop, const char *strval,
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t intval, zprop_source_t src)
|
|
|
|
{
|
|
|
|
const char *propname = zpool_prop_to_name(prop);
|
|
|
|
nvlist_t *propval;
|
|
|
|
|
2021-10-27 02:15:38 +03:00
|
|
|
propval = fnvlist_alloc();
|
|
|
|
fnvlist_add_uint64(propval, ZPROP_SOURCE, src);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (strval != NULL)
|
2021-10-27 02:15:38 +03:00
|
|
|
fnvlist_add_string(propval, ZPROP_VALUE, strval);
|
2008-11-20 23:01:55 +03:00
|
|
|
else
|
2021-10-27 02:15:38 +03:00
|
|
|
fnvlist_add_uint64(propval, ZPROP_VALUE, intval);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-10-27 02:15:38 +03:00
|
|
|
fnvlist_add_nvlist(nvl, propname, propval);
|
2008-11-20 23:01:55 +03:00
|
|
|
nvlist_free(propval);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Get property values from the spa configuration.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
spa_prop_get_config(spa_t *spa, nvlist_t **nvp)
|
|
|
|
{
|
2012-01-24 06:43:32 +04:00
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
2012-12-14 03:24:15 +04:00
|
|
|
dsl_pool_t *pool = spa->spa_dsl_pool;
|
2014-07-20 00:19:24 +04:00
|
|
|
uint64_t size, alloc, cap, version;
|
2016-01-06 00:46:54 +03:00
|
|
|
const zprop_source_t src = ZPROP_SRC_NONE;
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_dirent_t *dp;
|
2014-07-20 00:19:24 +04:00
|
|
|
metaslab_class_t *mc = spa_normal_class(spa);
|
2008-12-03 23:09:06 +03:00
|
|
|
|
|
|
|
ASSERT(MUTEX_HELD(&spa->spa_props_lock));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-01-24 06:43:32 +04:00
|
|
|
if (rvd != NULL) {
|
2018-09-06 04:33:36 +03:00
|
|
|
alloc = metaslab_class_get_alloc(mc);
|
|
|
|
alloc += metaslab_class_get_alloc(spa_special_class(spa));
|
|
|
|
alloc += metaslab_class_get_alloc(spa_dedup_class(spa));
|
Set aside a metaslab for ZIL blocks
Mixing ZIL and normal allocations has several problems:
1. The ZIL allocations are allocated, written to disk, and then a few
seconds later freed. This leaves behind holes (free segments) where the
ZIL blocks used to be, which increases fragmentation, which negatively
impacts performance.
2. When under moderate load, ZIL allocations are of 128KB. If the pool
is fairly fragmented, there may not be many free chunks of that size.
This causes ZFS to load more metaslabs to locate free segments of 128KB
or more. The loading happens synchronously (from zil_commit()), and can
take around a second even if the metaslab's spacemap is cached in the
ARC. All concurrent synchronous operations on this filesystem must wait
while the metaslab is loading. This can cause a significant performance
impact.
3. If the pool is very fragmented, there may be zero free chunks of
128KB or more. In this case, the ZIL falls back to txg_wait_synced(),
which has an enormous performance impact.
These problems can be eliminated by using a dedicated log device
("slog"), even one with the same performance characteristics as the
normal devices.
This change sets aside one metaslab from each top-level vdev that is
preferentially used for ZIL allocations (vdev_log_mg,
spa_embedded_log_class). From an allocation perspective, this is
similar to having a dedicated log device, and it eliminates the
above-mentioned performance problems.
Log (ZIL) blocks can be allocated from the following locations. Each
one is tried in order until the allocation succeeds:
1. dedicated log vdevs, aka "slog" (spa_log_class)
2. embedded slog metaslabs (spa_embedded_log_class)
3. other metaslabs in normal vdevs (spa_normal_class)
The space required for the embedded slog metaslabs is usually between
0.5% and 1.0% of the pool, and comes out of the existing 3.2% of "slop"
space that is not available for user data.
On an all-ssd system with 4TB storage, 87% fragmentation, 60% capacity,
and recordsize=8k, testing shows a ~50% performance increase on random
8k sync writes. On even more fragmented systems (which hit problem #3
above and call txg_wait_synced()), the performance improvement can be
arbitrarily large (>100x).
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11389
2021-01-22 02:12:54 +03:00
|
|
|
alloc += metaslab_class_get_alloc(spa_embedded_log_class(spa));
|
2018-09-06 04:33:36 +03:00
|
|
|
|
|
|
|
size = metaslab_class_get_space(mc);
|
|
|
|
size += metaslab_class_get_space(spa_special_class(spa));
|
|
|
|
size += metaslab_class_get_space(spa_dedup_class(spa));
|
Set aside a metaslab for ZIL blocks
Mixing ZIL and normal allocations has several problems:
1. The ZIL allocations are allocated, written to disk, and then a few
seconds later freed. This leaves behind holes (free segments) where the
ZIL blocks used to be, which increases fragmentation, which negatively
impacts performance.
2. When under moderate load, ZIL allocations are of 128KB. If the pool
is fairly fragmented, there may not be many free chunks of that size.
This causes ZFS to load more metaslabs to locate free segments of 128KB
or more. The loading happens synchronously (from zil_commit()), and can
take around a second even if the metaslab's spacemap is cached in the
ARC. All concurrent synchronous operations on this filesystem must wait
while the metaslab is loading. This can cause a significant performance
impact.
3. If the pool is very fragmented, there may be zero free chunks of
128KB or more. In this case, the ZIL falls back to txg_wait_synced(),
which has an enormous performance impact.
These problems can be eliminated by using a dedicated log device
("slog"), even one with the same performance characteristics as the
normal devices.
This change sets aside one metaslab from each top-level vdev that is
preferentially used for ZIL allocations (vdev_log_mg,
spa_embedded_log_class). From an allocation perspective, this is
similar to having a dedicated log device, and it eliminates the
above-mentioned performance problems.
Log (ZIL) blocks can be allocated from the following locations. Each
one is tried in order until the allocation succeeds:
1. dedicated log vdevs, aka "slog" (spa_log_class)
2. embedded slog metaslabs (spa_embedded_log_class)
3. other metaslabs in normal vdevs (spa_normal_class)
The space required for the embedded slog metaslabs is usually between
0.5% and 1.0% of the pool, and comes out of the existing 3.2% of "slop"
space that is not available for user data.
On an all-ssd system with 4TB storage, 87% fragmentation, 60% capacity,
and recordsize=8k, testing shows a ~50% performance increase on random
8k sync writes. On even more fragmented systems (which hit problem #3
above and call txg_wait_synced()), the performance improvement can be
arbitrarily large (>100x).
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11389
2021-01-22 02:12:54 +03:00
|
|
|
size += metaslab_class_get_space(spa_embedded_log_class(spa));
|
2018-09-06 04:33:36 +03:00
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
spa_prop_add_list(*nvp, ZPOOL_PROP_NAME, spa_name(spa), 0, src);
|
|
|
|
spa_prop_add_list(*nvp, ZPOOL_PROP_SIZE, NULL, size, src);
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_prop_add_list(*nvp, ZPOOL_PROP_ALLOCATED, NULL, alloc, src);
|
|
|
|
spa_prop_add_list(*nvp, ZPOOL_PROP_FREE, NULL,
|
|
|
|
size - alloc, src);
|
2016-12-17 01:11:29 +03:00
|
|
|
spa_prop_add_list(*nvp, ZPOOL_PROP_CHECKPOINT, NULL,
|
|
|
|
spa->spa_checkpoint_info.sci_dspace, src);
|
2012-01-24 06:43:32 +04:00
|
|
|
|
2014-07-20 00:19:24 +04:00
|
|
|
spa_prop_add_list(*nvp, ZPOOL_PROP_FRAGMENTATION, NULL,
|
|
|
|
metaslab_class_fragmentation(mc), src);
|
|
|
|
spa_prop_add_list(*nvp, ZPOOL_PROP_EXPANDSZ, NULL,
|
|
|
|
metaslab_class_expandable_space(mc), src);
|
2010-08-27 01:24:34 +04:00
|
|
|
spa_prop_add_list(*nvp, ZPOOL_PROP_READONLY, NULL,
|
2019-11-21 20:32:57 +03:00
|
|
|
(spa_mode(spa) == SPA_MODE_READ), src);
|
2009-02-18 23:51:31 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
cap = (size == 0) ? 0 : (alloc * 100 / size);
|
2009-02-18 23:51:31 +03:00
|
|
|
spa_prop_add_list(*nvp, ZPOOL_PROP_CAPACITY, NULL, cap, src);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_prop_add_list(*nvp, ZPOOL_PROP_DEDUPRATIO, NULL,
|
|
|
|
ddt_get_pool_dedup_ratio(spa), src);
|
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
spa_prop_add_list(*nvp, ZPOOL_PROP_HEALTH, NULL,
|
2012-01-24 06:43:32 +04:00
|
|
|
rvd->vdev_state, src);
|
2009-02-18 23:51:31 +03:00
|
|
|
|
|
|
|
version = spa_version(spa);
|
2016-01-06 00:46:54 +03:00
|
|
|
if (version == zpool_prop_default_numeric(ZPOOL_PROP_VERSION)) {
|
|
|
|
spa_prop_add_list(*nvp, ZPOOL_PROP_VERSION, NULL,
|
|
|
|
version, ZPROP_SRC_DEFAULT);
|
|
|
|
} else {
|
|
|
|
spa_prop_add_list(*nvp, ZPOOL_PROP_VERSION, NULL,
|
|
|
|
version, ZPROP_SRC_LOCAL);
|
|
|
|
}
|
2018-08-20 19:52:37 +03:00
|
|
|
spa_prop_add_list(*nvp, ZPOOL_PROP_LOAD_GUID,
|
|
|
|
NULL, spa_load_guid(spa), src);
|
2009-02-18 23:51:31 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-12-14 03:24:15 +04:00
|
|
|
if (pool != NULL) {
|
|
|
|
/*
|
|
|
|
* The $FREE directory was introduced in SPA_VERSION_DEADLISTS,
|
|
|
|
* when opening pools before this version freedir will be NULL.
|
|
|
|
*/
|
2014-06-06 01:20:08 +04:00
|
|
|
if (pool->dp_free_dir != NULL) {
|
2012-12-14 03:24:15 +04:00
|
|
|
spa_prop_add_list(*nvp, ZPOOL_PROP_FREEING, NULL,
|
2015-04-01 18:14:34 +03:00
|
|
|
dsl_dir_phys(pool->dp_free_dir)->dd_used_bytes,
|
|
|
|
src);
|
2012-12-14 03:24:15 +04:00
|
|
|
} else {
|
|
|
|
spa_prop_add_list(*nvp, ZPOOL_PROP_FREEING,
|
|
|
|
NULL, 0, src);
|
|
|
|
}
|
2014-06-06 01:20:08 +04:00
|
|
|
|
|
|
|
if (pool->dp_leak_dir != NULL) {
|
|
|
|
spa_prop_add_list(*nvp, ZPOOL_PROP_LEAKED, NULL,
|
2015-04-01 18:14:34 +03:00
|
|
|
dsl_dir_phys(pool->dp_leak_dir)->dd_used_bytes,
|
|
|
|
src);
|
2014-06-06 01:20:08 +04:00
|
|
|
} else {
|
|
|
|
spa_prop_add_list(*nvp, ZPOOL_PROP_LEAKED,
|
|
|
|
NULL, 0, src);
|
|
|
|
}
|
2012-12-14 03:24:15 +04:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
spa_prop_add_list(*nvp, ZPOOL_PROP_GUID, NULL, spa_guid(spa), src);
|
|
|
|
|
2011-11-15 23:01:27 +04:00
|
|
|
if (spa->spa_comment != NULL) {
|
|
|
|
spa_prop_add_list(*nvp, ZPOOL_PROP_COMMENT, spa->spa_comment,
|
|
|
|
0, ZPROP_SRC_LOCAL);
|
|
|
|
}
|
|
|
|
|
2021-02-18 08:30:45 +03:00
|
|
|
if (spa->spa_compatibility != NULL) {
|
|
|
|
spa_prop_add_list(*nvp, ZPOOL_PROP_COMPATIBILITY,
|
|
|
|
spa->spa_compatibility, 0, ZPROP_SRC_LOCAL);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (spa->spa_root != NULL)
|
|
|
|
spa_prop_add_list(*nvp, ZPOOL_PROP_ALTROOT, spa->spa_root,
|
|
|
|
0, ZPROP_SRC_LOCAL);
|
|
|
|
|
2014-11-03 23:15:08 +03:00
|
|
|
if (spa_feature_is_enabled(spa, SPA_FEATURE_LARGE_BLOCKS)) {
|
|
|
|
spa_prop_add_list(*nvp, ZPOOL_PROP_MAXBLOCKSIZE, NULL,
|
|
|
|
MIN(zfs_max_recordsize, SPA_MAXBLOCKSIZE), ZPROP_SRC_NONE);
|
|
|
|
} else {
|
|
|
|
spa_prop_add_list(*nvp, ZPOOL_PROP_MAXBLOCKSIZE, NULL,
|
|
|
|
SPA_OLD_MAXBLOCKSIZE, ZPROP_SRC_NONE);
|
|
|
|
}
|
|
|
|
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
if (spa_feature_is_enabled(spa, SPA_FEATURE_LARGE_DNODE)) {
|
|
|
|
spa_prop_add_list(*nvp, ZPOOL_PROP_MAXDNODESIZE, NULL,
|
|
|
|
DNODE_MAX_SIZE, ZPROP_SRC_NONE);
|
|
|
|
} else {
|
|
|
|
spa_prop_add_list(*nvp, ZPOOL_PROP_MAXDNODESIZE, NULL,
|
|
|
|
DNODE_MIN_SIZE, ZPROP_SRC_NONE);
|
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
if ((dp = list_head(&spa->spa_config_list)) != NULL) {
|
|
|
|
if (dp->scd_path == NULL) {
|
2008-11-20 23:01:55 +03:00
|
|
|
spa_prop_add_list(*nvp, ZPOOL_PROP_CACHEFILE,
|
2008-12-03 23:09:06 +03:00
|
|
|
"none", 0, ZPROP_SRC_LOCAL);
|
|
|
|
} else if (strcmp(dp->scd_path, spa_config_path) != 0) {
|
2008-11-20 23:01:55 +03:00
|
|
|
spa_prop_add_list(*nvp, ZPOOL_PROP_CACHEFILE,
|
2008-12-03 23:09:06 +03:00
|
|
|
dp->scd_path, 0, ZPROP_SRC_LOCAL);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Get zpool property values.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
spa_prop_get(spa_t *spa, nvlist_t **nvp)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *mos = spa->spa_meta_objset;
|
2008-11-20 23:01:55 +03:00
|
|
|
zap_cursor_t zc;
|
|
|
|
zap_attribute_t za;
|
2020-07-14 22:21:57 +03:00
|
|
|
dsl_pool_t *dp;
|
2008-11-20 23:01:55 +03:00
|
|
|
int err;
|
|
|
|
|
2014-11-21 03:09:39 +03:00
|
|
|
err = nvlist_alloc(nvp, NV_UNIQUE_NAME, KM_SLEEP);
|
2010-08-26 22:49:16 +04:00
|
|
|
if (err)
|
2013-11-01 23:26:11 +04:00
|
|
|
return (err);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2020-07-14 22:21:57 +03:00
|
|
|
dp = spa_get_dsl(spa);
|
|
|
|
dsl_pool_config_enter(dp, FTAG);
|
2008-12-03 23:09:06 +03:00
|
|
|
mutex_enter(&spa->spa_props_lock);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Get properties from the spa config.
|
|
|
|
*/
|
|
|
|
spa_prop_get_config(spa, nvp);
|
|
|
|
|
|
|
|
/* If no pool property object, no more prop to get. */
|
2020-07-14 22:21:57 +03:00
|
|
|
if (mos == NULL || spa->spa_pool_props_object == 0)
|
2010-08-26 22:49:16 +04:00
|
|
|
goto out;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Get properties from the MOS pool property object.
|
|
|
|
*/
|
|
|
|
for (zap_cursor_init(&zc, mos, spa->spa_pool_props_object);
|
|
|
|
(err = zap_cursor_retrieve(&zc, &za)) == 0;
|
|
|
|
zap_cursor_advance(&zc)) {
|
|
|
|
uint64_t intval = 0;
|
|
|
|
char *strval = NULL;
|
|
|
|
zprop_source_t src = ZPROP_SRC_DEFAULT;
|
|
|
|
zpool_prop_t prop;
|
|
|
|
|
2018-01-19 20:22:37 +03:00
|
|
|
if ((prop = zpool_name_to_prop(za.za_name)) == ZPOOL_PROP_INVAL)
|
2008-11-20 23:01:55 +03:00
|
|
|
continue;
|
|
|
|
|
|
|
|
switch (za.za_integer_length) {
|
|
|
|
case 8:
|
|
|
|
/* integer property */
|
|
|
|
if (za.za_first_integer !=
|
|
|
|
zpool_prop_default_numeric(prop))
|
|
|
|
src = ZPROP_SRC_LOCAL;
|
|
|
|
|
|
|
|
if (prop == ZPOOL_PROP_BOOTFS) {
|
|
|
|
dsl_dataset_t *ds = NULL;
|
|
|
|
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
err = dsl_dataset_hold_obj(dp,
|
|
|
|
za.za_first_integer, FTAG, &ds);
|
2020-07-14 22:21:57 +03:00
|
|
|
if (err != 0)
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
|
2016-06-16 00:28:36 +03:00
|
|
|
strval = kmem_alloc(ZFS_MAX_DATASET_NAME_LEN,
|
2014-11-21 03:09:39 +03:00
|
|
|
KM_SLEEP);
|
2008-11-20 23:01:55 +03:00
|
|
|
dsl_dataset_name(ds, strval);
|
2008-12-03 23:09:06 +03:00
|
|
|
dsl_dataset_rele(ds, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
|
|
|
strval = NULL;
|
|
|
|
intval = za.za_first_integer;
|
|
|
|
}
|
|
|
|
|
|
|
|
spa_prop_add_list(*nvp, prop, strval, intval, src);
|
|
|
|
|
|
|
|
if (strval != NULL)
|
2016-06-16 00:28:36 +03:00
|
|
|
kmem_free(strval, ZFS_MAX_DATASET_NAME_LEN);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
break;
|
|
|
|
|
|
|
|
case 1:
|
|
|
|
/* string property */
|
2014-11-21 03:09:39 +03:00
|
|
|
strval = kmem_alloc(za.za_num_integers, KM_SLEEP);
|
2008-11-20 23:01:55 +03:00
|
|
|
err = zap_lookup(mos, spa->spa_pool_props_object,
|
|
|
|
za.za_name, 1, za.za_num_integers, strval);
|
|
|
|
if (err) {
|
|
|
|
kmem_free(strval, za.za_num_integers);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
spa_prop_add_list(*nvp, prop, strval, 0, src);
|
|
|
|
kmem_free(strval, za.za_num_integers);
|
|
|
|
break;
|
|
|
|
|
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
zap_cursor_fini(&zc);
|
|
|
|
out:
|
2020-07-14 22:21:57 +03:00
|
|
|
mutex_exit(&spa->spa_props_lock);
|
|
|
|
dsl_pool_config_exit(dp, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (err && err != ENOENT) {
|
|
|
|
nvlist_free(*nvp);
|
|
|
|
*nvp = NULL;
|
|
|
|
return (err);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Validate the given pool properties nvlist and modify the list
|
|
|
|
* for the property values to be set.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
spa_prop_validate(spa_t *spa, nvlist_t *props)
|
|
|
|
{
|
|
|
|
nvpair_t *elem;
|
|
|
|
int error = 0, reset_bootfs = 0;
|
2010-08-26 20:58:04 +04:00
|
|
|
uint64_t objnum = 0;
|
2012-12-14 03:24:15 +04:00
|
|
|
boolean_t has_feature = B_FALSE;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
elem = NULL;
|
|
|
|
while ((elem = nvlist_next_nvpair(props, elem)) != NULL) {
|
|
|
|
uint64_t intval;
|
2012-12-14 03:24:15 +04:00
|
|
|
char *strval, *slash, *check, *fname;
|
|
|
|
const char *propname = nvpair_name(elem);
|
|
|
|
zpool_prop_t prop = zpool_name_to_prop(propname);
|
|
|
|
|
2018-01-19 20:22:37 +03:00
|
|
|
switch (prop) {
|
|
|
|
case ZPOOL_PROP_INVAL:
|
2012-12-14 03:24:15 +04:00
|
|
|
if (!zpool_prop_feature(propname)) {
|
2013-03-08 22:41:28 +04:00
|
|
|
error = SET_ERROR(EINVAL);
|
2012-12-14 03:24:15 +04:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Sanitize the input.
|
|
|
|
*/
|
|
|
|
if (nvpair_type(elem) != DATA_TYPE_UINT64) {
|
2013-03-08 22:41:28 +04:00
|
|
|
error = SET_ERROR(EINVAL);
|
2012-12-14 03:24:15 +04:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (nvpair_value_uint64(elem, &intval) != 0) {
|
2013-03-08 22:41:28 +04:00
|
|
|
error = SET_ERROR(EINVAL);
|
2012-12-14 03:24:15 +04:00
|
|
|
break;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-12-14 03:24:15 +04:00
|
|
|
if (intval != 0) {
|
2013-03-08 22:41:28 +04:00
|
|
|
error = SET_ERROR(EINVAL);
|
2012-12-14 03:24:15 +04:00
|
|
|
break;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-12-14 03:24:15 +04:00
|
|
|
fname = strchr(propname, '@') + 1;
|
|
|
|
if (zfeature_lookup_name(fname, NULL) != 0) {
|
2013-03-08 22:41:28 +04:00
|
|
|
error = SET_ERROR(EINVAL);
|
2012-12-14 03:24:15 +04:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
has_feature = B_TRUE;
|
|
|
|
break;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
case ZPOOL_PROP_VERSION:
|
|
|
|
error = nvpair_value_uint64(elem, &intval);
|
|
|
|
if (!error &&
|
2012-12-14 03:24:15 +04:00
|
|
|
(intval < spa_version(spa) ||
|
|
|
|
intval > SPA_VERSION_BEFORE_FEATURES ||
|
|
|
|
has_feature))
|
2013-03-08 22:41:28 +04:00
|
|
|
error = SET_ERROR(EINVAL);
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
|
|
|
|
case ZPOOL_PROP_DELEGATION:
|
|
|
|
case ZPOOL_PROP_AUTOREPLACE:
|
2008-12-03 23:09:06 +03:00
|
|
|
case ZPOOL_PROP_LISTSNAPS:
|
2009-07-03 02:44:48 +04:00
|
|
|
case ZPOOL_PROP_AUTOEXPAND:
|
2019-03-29 19:13:20 +03:00
|
|
|
case ZPOOL_PROP_AUTOTRIM:
|
2008-11-20 23:01:55 +03:00
|
|
|
error = nvpair_value_uint64(elem, &intval);
|
|
|
|
if (!error && intval > 1)
|
2013-03-08 22:41:28 +04:00
|
|
|
error = SET_ERROR(EINVAL);
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
case ZPOOL_PROP_MULTIHOST:
|
|
|
|
error = nvpair_value_uint64(elem, &intval);
|
|
|
|
if (!error && intval > 1)
|
|
|
|
error = SET_ERROR(EINVAL);
|
|
|
|
|
2019-09-10 23:42:30 +03:00
|
|
|
if (!error) {
|
|
|
|
uint32_t hostid = zone_get_hostid(NULL);
|
|
|
|
if (hostid)
|
|
|
|
spa->spa_hostid = hostid;
|
|
|
|
else
|
|
|
|
error = SET_ERROR(ENOTSUP);
|
|
|
|
}
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
|
|
|
|
break;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
case ZPOOL_PROP_BOOTFS:
|
2009-07-03 02:44:48 +04:00
|
|
|
/*
|
|
|
|
* If the pool version is less than SPA_VERSION_BOOTFS,
|
|
|
|
* or the pool is still being created (version == 0),
|
|
|
|
* the bootfs property cannot be set.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
if (spa_version(spa) < SPA_VERSION_BOOTFS) {
|
2013-03-08 22:41:28 +04:00
|
|
|
error = SET_ERROR(ENOTSUP);
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* Make sure the vdev config is bootable
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
if (!vdev_is_bootable(spa->spa_root_vdev)) {
|
2013-03-08 22:41:28 +04:00
|
|
|
error = SET_ERROR(ENOTSUP);
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
reset_bootfs = 1;
|
|
|
|
|
|
|
|
error = nvpair_value_string(elem, &strval);
|
|
|
|
|
|
|
|
if (!error) {
|
2012-12-14 03:24:15 +04:00
|
|
|
objset_t *os;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (strval == NULL || strval[0] == '\0') {
|
|
|
|
objnum = zpool_prop_default_numeric(
|
|
|
|
ZPOOL_PROP_BOOTFS);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
error = dmu_objset_hold(strval, FTAG, &os);
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
if (error != 0)
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
Remove GRUB restrictions
The GRUB restrictions are based around the pool's bootfs property.
Given the current situation where GRUB is not staying current with
OpenZFS pool features, having either a non-ZFS /boot or a separate
pool with limited features are pretty much the only long-term answers
for GRUB support. Only the second case matters in this context. For
the restrictions to be useful, the bootfs property would have to be set
on the boot pool, because that is where we need the restrictions, as
that is the pool that GRUB reads from. The documentation for bootfs
describes it as pointing to the root pool. That's also how it's used in
the initramfs. ZFS does not allow setting bootfs to point to a dataset
in another pool. (If it did, it'd be difficult-to-impossible to enforce
these restrictions cross-pool). Accordingly, bootfs is pretty much
useless for GRUB scenarios moving forward.
Even for users who have only one pool, the existing restrictions for
GRUB are incomplete. They don't prevent you from enabling the
unsupported checksums, for example. For that reason, I have ripped out
all the GRUB restrictions.
A little longer-term, I think extending the proposed features=portable
system to define a features=grub is a much more useful approach. The
user could set that on the boot pool at creation, and things would
Just Work.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8627
2020-08-18 09:12:39 +03:00
|
|
|
/* Must be ZPL. */
|
2010-05-29 00:45:14 +04:00
|
|
|
if (dmu_objset_type(os) != DMU_OST_ZFS) {
|
2013-03-08 22:41:28 +04:00
|
|
|
error = SET_ERROR(ENOTSUP);
|
2008-12-03 23:09:06 +03:00
|
|
|
} else {
|
|
|
|
objnum = dmu_objset_id(os);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_objset_rele(os, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
break;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
case ZPOOL_PROP_FAILUREMODE:
|
|
|
|
error = nvpair_value_uint64(elem, &intval);
|
2016-11-10 04:35:26 +03:00
|
|
|
if (!error && intval > ZIO_FAILURE_MODE_PANIC)
|
2013-03-08 22:41:28 +04:00
|
|
|
error = SET_ERROR(EINVAL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* This is a special case which only occurs when
|
|
|
|
* the pool has completely failed. This allows
|
|
|
|
* the user to change the in-core failmode property
|
|
|
|
* without syncing it out to disk (I/Os might
|
|
|
|
* currently be blocked). We do this by returning
|
|
|
|
* EIO to the caller (spa_prop_set) to trick it
|
|
|
|
* into thinking we encountered a property validation
|
|
|
|
* error.
|
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
if (!error && spa_suspended(spa)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
spa->spa_failmode = intval;
|
2013-03-08 22:41:28 +04:00
|
|
|
error = SET_ERROR(EIO);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
break;
|
|
|
|
|
|
|
|
case ZPOOL_PROP_CACHEFILE:
|
|
|
|
if ((error = nvpair_value_string(elem, &strval)) != 0)
|
|
|
|
break;
|
|
|
|
|
|
|
|
if (strval[0] == '\0')
|
|
|
|
break;
|
|
|
|
|
|
|
|
if (strcmp(strval, "none") == 0)
|
|
|
|
break;
|
|
|
|
|
|
|
|
if (strval[0] != '/') {
|
2013-03-08 22:41:28 +04:00
|
|
|
error = SET_ERROR(EINVAL);
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
slash = strrchr(strval, '/');
|
|
|
|
ASSERT(slash != NULL);
|
|
|
|
|
|
|
|
if (slash[1] == '\0' || strcmp(slash, "/.") == 0 ||
|
|
|
|
strcmp(slash, "/..") == 0)
|
2013-03-08 22:41:28 +04:00
|
|
|
error = SET_ERROR(EINVAL);
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2011-11-15 23:01:27 +04:00
|
|
|
case ZPOOL_PROP_COMMENT:
|
|
|
|
if ((error = nvpair_value_string(elem, &strval)) != 0)
|
|
|
|
break;
|
|
|
|
for (check = strval; *check != '\0'; check++) {
|
|
|
|
if (!isprint(*check)) {
|
2013-03-08 22:41:28 +04:00
|
|
|
error = SET_ERROR(EINVAL);
|
2011-11-15 23:01:27 +04:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (strlen(strval) > ZPROP_MAX_COMMENT)
|
2013-03-08 22:41:28 +04:00
|
|
|
error = SET_ERROR(E2BIG);
|
2011-11-15 23:01:27 +04:00
|
|
|
break;
|
|
|
|
|
2010-08-26 20:52:41 +04:00
|
|
|
default:
|
|
|
|
break;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
if (error)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
Remove dedupditto functionality
If dedup is in use, the `dedupditto` property can be set, causing ZFS to
keep an extra copy of data that is referenced many times (>100x). The
idea was that this data is more important than other data and thus we
want to be really sure that it is not lost if the disk experiences a
small amount of random corruption.
ZFS (and system administrators) rely on the pool-level redundancy to
protect their data (e.g. mirroring or RAIDZ). Since the user/sysadmin
doesn't have control over what data will be offered extra redundancy by
dedupditto, this extra redundancy is not very useful. The bulk of the
data is still vulnerable to loss based on the pool-level redundancy.
For example, if particle strikes corrupt 0.1% of blocks, you will either
be saved by mirror/raidz, or you will be sad. This is true even if
dedupditto saved another 0.01% of blocks from being corrupted.
Therefore, the dedupditto functionality is rarely enabled (i.e. the
property is rarely set), and it fulfills its promise of increased
redundancy even more rarely.
Additionally, this feature does not work as advertised (on existing
releases), because scrub/resilver did not repair the extra (dedupditto)
copy (see https://github.com/zfsonlinux/zfs/pull/8270).
In summary, this seldom-used feature doesn't work, and even if it did it
wouldn't provide useful data protection. It has a non-trivial
maintenance burden (again see https://github.com/zfsonlinux/zfs/pull/8270).
We should remove the dedupditto functionality. For backwards
compatibility with the existing CLI, "zpool set dedupditto" will still
"succeed" (exit code zero), but won't have any effect. For backwards
compatibility with existing pools that had dedupditto enabled at some
point, the code will still be able to understand dedupditto blocks and
free them when appropriate. However, ZFS won't write any new dedupditto
blocks.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Issue #8270
Closes #8310
2019-06-20 00:54:02 +03:00
|
|
|
(void) nvlist_remove_all(props,
|
|
|
|
zpool_prop_to_name(ZPOOL_PROP_DEDUPDITTO));
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (!error && reset_bootfs) {
|
|
|
|
error = nvlist_remove(props,
|
|
|
|
zpool_prop_to_name(ZPOOL_PROP_BOOTFS), DATA_TYPE_STRING);
|
|
|
|
|
|
|
|
if (!error) {
|
|
|
|
error = nvlist_add_uint64(props,
|
|
|
|
zpool_prop_to_name(ZPOOL_PROP_BOOTFS), objnum);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
void
|
|
|
|
spa_configfile_set(spa_t *spa, nvlist_t *nvp, boolean_t need_sync)
|
|
|
|
{
|
|
|
|
char *cachefile;
|
|
|
|
spa_config_dirent_t *dp;
|
|
|
|
|
|
|
|
if (nvlist_lookup_string(nvp, zpool_prop_to_name(ZPOOL_PROP_CACHEFILE),
|
|
|
|
&cachefile) != 0)
|
|
|
|
return;
|
|
|
|
|
|
|
|
dp = kmem_alloc(sizeof (spa_config_dirent_t),
|
2014-11-21 03:09:39 +03:00
|
|
|
KM_SLEEP);
|
2009-02-18 23:51:31 +03:00
|
|
|
|
|
|
|
if (cachefile[0] == '\0')
|
|
|
|
dp->scd_path = spa_strdup(spa_config_path);
|
|
|
|
else if (strcmp(cachefile, "none") == 0)
|
|
|
|
dp->scd_path = NULL;
|
|
|
|
else
|
|
|
|
dp->scd_path = spa_strdup(cachefile);
|
|
|
|
|
|
|
|
list_insert_head(&spa->spa_config_list, dp);
|
|
|
|
if (need_sync)
|
|
|
|
spa_async_request(spa, SPA_ASYNC_CONFIG_UPDATE);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
int
|
|
|
|
spa_prop_set(spa_t *spa, nvlist_t *nvp)
|
|
|
|
{
|
|
|
|
int error;
|
2012-12-14 03:24:15 +04:00
|
|
|
nvpair_t *elem = NULL;
|
2009-02-18 23:51:31 +03:00
|
|
|
boolean_t need_sync = B_FALSE;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if ((error = spa_prop_validate(spa, nvp)) != 0)
|
|
|
|
return (error);
|
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
while ((elem = nvlist_next_nvpair(nvp, elem)) != NULL) {
|
2012-12-14 03:24:15 +04:00
|
|
|
zpool_prop_t prop = zpool_name_to_prop(nvpair_name(elem));
|
2009-02-18 23:51:31 +03:00
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
if (prop == ZPOOL_PROP_CACHEFILE ||
|
|
|
|
prop == ZPOOL_PROP_ALTROOT ||
|
|
|
|
prop == ZPOOL_PROP_READONLY)
|
2009-02-18 23:51:31 +03:00
|
|
|
continue;
|
|
|
|
|
2018-01-19 20:22:37 +03:00
|
|
|
if (prop == ZPOOL_PROP_VERSION || prop == ZPOOL_PROP_INVAL) {
|
2021-11-30 17:46:25 +03:00
|
|
|
uint64_t ver = 0;
|
2012-12-14 03:24:15 +04:00
|
|
|
|
|
|
|
if (prop == ZPOOL_PROP_VERSION) {
|
|
|
|
VERIFY(nvpair_value_uint64(elem, &ver) == 0);
|
|
|
|
} else {
|
|
|
|
ASSERT(zpool_prop_feature(nvpair_name(elem)));
|
|
|
|
ver = SPA_VERSION_FEATURES;
|
|
|
|
need_sync = B_TRUE;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Save time if the version is already set. */
|
|
|
|
if (ver == spa_version(spa))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* In addition to the pool directory object, we might
|
|
|
|
* create the pool properties object, the features for
|
|
|
|
* read object, the features for write object, or the
|
|
|
|
* feature descriptions object.
|
|
|
|
*/
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dsl_sync_task(spa->spa_name, NULL,
|
2014-11-03 23:28:43 +03:00
|
|
|
spa_sync_version, &ver,
|
|
|
|
6, ZFS_SPACE_CHECK_RESERVED);
|
2012-12-14 03:24:15 +04:00
|
|
|
if (error)
|
|
|
|
return (error);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
need_sync = B_TRUE;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2012-12-14 03:24:15 +04:00
|
|
|
if (need_sync) {
|
2013-09-04 16:00:57 +04:00
|
|
|
return (dsl_sync_task(spa->spa_name, NULL, spa_sync_props,
|
2014-11-03 23:28:43 +03:00
|
|
|
nvp, 6, ZFS_SPACE_CHECK_RESERVED));
|
2012-12-14 03:24:15 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
return (0);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the bootfs property value is dsobj, clear it.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
spa_prop_clear_bootfs(spa_t *spa, uint64_t dsobj, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
if (spa->spa_bootfs == dsobj && spa->spa_pool_props_object != 0) {
|
|
|
|
VERIFY(zap_remove(spa->spa_meta_objset,
|
|
|
|
spa->spa_pool_props_object,
|
|
|
|
zpool_prop_to_name(ZPOOL_PROP_BOOTFS), tx) == 0);
|
|
|
|
spa->spa_bootfs = 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-12-15 00:38:04 +04:00
|
|
|
static int
|
2013-09-04 16:00:57 +04:00
|
|
|
spa_change_guid_check(void *arg, dmu_tx_t *tx)
|
2012-12-15 00:38:04 +04:00
|
|
|
{
|
2019-12-05 23:37:00 +03:00
|
|
|
uint64_t *newguid __maybe_unused = arg;
|
2013-09-04 16:00:57 +04:00
|
|
|
spa_t *spa = dmu_tx_pool(tx)->dp_spa;
|
2012-12-15 00:38:04 +04:00
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
uint64_t vdev_state;
|
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
if (spa_feature_is_active(spa, SPA_FEATURE_POOL_CHECKPOINT)) {
|
|
|
|
int error = (spa_has_checkpoint(spa)) ?
|
|
|
|
ZFS_ERR_CHECKPOINT_EXISTS : ZFS_ERR_DISCARDING_CHECKPOINT;
|
|
|
|
return (SET_ERROR(error));
|
|
|
|
}
|
|
|
|
|
2012-12-15 00:38:04 +04:00
|
|
|
spa_config_enter(spa, SCL_STATE, FTAG, RW_READER);
|
|
|
|
vdev_state = rvd->vdev_state;
|
|
|
|
spa_config_exit(spa, SCL_STATE, FTAG);
|
|
|
|
|
|
|
|
if (vdev_state != VDEV_STATE_HEALTHY)
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(ENXIO));
|
2012-12-15 00:38:04 +04:00
|
|
|
|
|
|
|
ASSERT3U(spa_guid(spa), !=, *newguid);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2013-09-04 16:00:57 +04:00
|
|
|
spa_change_guid_sync(void *arg, dmu_tx_t *tx)
|
2012-12-15 00:38:04 +04:00
|
|
|
{
|
2013-09-04 16:00:57 +04:00
|
|
|
uint64_t *newguid = arg;
|
|
|
|
spa_t *spa = dmu_tx_pool(tx)->dp_spa;
|
2012-12-15 00:38:04 +04:00
|
|
|
uint64_t oldguid;
|
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
|
|
|
|
oldguid = spa_guid(spa);
|
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_STATE, FTAG, RW_READER);
|
|
|
|
rvd->vdev_guid = *newguid;
|
|
|
|
rvd->vdev_guid_sum += (*newguid - oldguid);
|
|
|
|
vdev_config_dirty(rvd);
|
|
|
|
spa_config_exit(spa, SCL_STATE, FTAG);
|
|
|
|
|
2013-08-28 15:45:09 +04:00
|
|
|
spa_history_log_internal(spa, "guid change", tx, "old=%llu new=%llu",
|
2019-09-12 23:28:26 +03:00
|
|
|
(u_longlong_t)oldguid, (u_longlong_t)*newguid);
|
2012-12-15 00:38:04 +04:00
|
|
|
}
|
|
|
|
|
2011-11-12 02:07:54 +04:00
|
|
|
/*
|
|
|
|
* Change the GUID for the pool. This is done so that we can later
|
|
|
|
* re-import a pool built from a clone of our own vdevs. We will modify
|
|
|
|
* the root vdev's guid, our own pool guid, and then mark all of our
|
|
|
|
* vdevs dirty. Note that we must make sure that all our vdevs are
|
|
|
|
* online when we do this, or else any vdevs that weren't present
|
|
|
|
* would be orphaned from our pool. We are also going to issue a
|
|
|
|
* sysevent to update any watchers.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
spa_change_guid(spa_t *spa)
|
|
|
|
{
|
2012-12-15 00:38:04 +04:00
|
|
|
int error;
|
|
|
|
uint64_t guid;
|
2011-11-12 02:07:54 +04:00
|
|
|
|
2013-08-07 22:24:34 +04:00
|
|
|
mutex_enter(&spa->spa_vdev_top_lock);
|
2012-12-15 00:38:04 +04:00
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
guid = spa_generate_guid(NULL);
|
2011-11-12 02:07:54 +04:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dsl_sync_task(spa->spa_name, spa_change_guid_check,
|
2014-11-03 23:28:43 +03:00
|
|
|
spa_change_guid_sync, &guid, 5, ZFS_SPACE_CHECK_RESERVED);
|
2011-11-12 02:07:54 +04:00
|
|
|
|
2012-12-15 00:38:04 +04:00
|
|
|
if (error == 0) {
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
spa_write_cachefile(spa, B_FALSE, B_TRUE);
|
2017-05-30 21:39:17 +03:00
|
|
|
spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_REGUID);
|
2012-12-15 00:38:04 +04:00
|
|
|
}
|
2011-11-12 02:07:54 +04:00
|
|
|
|
2012-12-15 00:38:04 +04:00
|
|
|
mutex_exit(&spa_namespace_lock);
|
2013-08-07 22:24:34 +04:00
|
|
|
mutex_exit(&spa->spa_vdev_top_lock);
|
2011-11-12 02:07:54 +04:00
|
|
|
|
2012-12-15 00:38:04 +04:00
|
|
|
return (error);
|
2011-11-12 02:07:54 +04:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* ==========================================================================
|
|
|
|
* SPA state manipulation (open/create/destroy/import/export)
|
|
|
|
* ==========================================================================
|
|
|
|
*/
|
|
|
|
|
|
|
|
static int
|
|
|
|
spa_error_entry_compare(const void *a, const void *b)
|
|
|
|
{
|
2016-08-27 21:12:53 +03:00
|
|
|
const spa_error_entry_t *sa = (const spa_error_entry_t *)a;
|
|
|
|
const spa_error_entry_t *sb = (const spa_error_entry_t *)b;
|
2008-11-20 23:01:55 +03:00
|
|
|
int ret;
|
|
|
|
|
2016-08-27 21:12:53 +03:00
|
|
|
ret = memcmp(&sa->se_bookmark, &sb->se_bookmark,
|
2014-06-25 22:37:59 +04:00
|
|
|
sizeof (zbookmark_phys_t));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Reduce loaded range tree memory usage
This patch implements a new tree structure for ZFS, and uses it to
store range trees more efficiently.
The new structure is approximately a B-tree, though there are some
small differences from the usual characterizations. The tree has core
nodes and leaf nodes; each contain data elements, which the elements
in the core nodes acting as separators between its children. The
difference between core and leaf nodes is that the core nodes have an
array of children, while leaf nodes don't. Every node in the tree may
be only partially full; in most cases, they are all at least 50% full
(in terms of element count) except for the root node, which can be
less full. Underfull nodes will steal from their neighbors or merge to
remain full enough, while overfull nodes will split in two. The data
elements are contained in tree-controlled buffers; they are copied
into these on insertion, and overwritten on deletion. This means that
the elements are not independently allocated, which reduces overhead,
but also means they can't be shared between trees (and also that
pointers to them are only valid until a side-effectful tree operation
occurs). The overhead varies based on how dense the tree is, but is
usually on the order of about 50% of the element size; the per-node
overheads are very small, and so don't make a significant difference.
The trees can accept arbitrary records; they accept a size and a
comparator to allow them to be used for a variety of purposes.
The new trees replace the AVL trees used in the range trees today.
Currently, the range_seg_t structure contains three 8 byte integers
of payload and two 24 byte avl_tree_node_ts to handle its storage in
both an offset-sorted tree and a size-sorted tree (total size: 64
bytes). In the new model, the range seg structures are usually two 4
byte integers, but a separate one needs to exist for the size-sorted
and offset-sorted tree. Between the raw size, the 50% overhead, and
the double storage, the new btrees are expected to use 8*1.5*2 = 24
bytes per record, or 33.3% as much memory as the AVL trees (this is
for the purposes of storing metaslab range trees; for other purposes,
like scrubs, they use ~50% as much memory).
We reduced the size of the payload in the range segments by teaching
range trees about starting offsets and shifts; since metaslabs have a
fixed starting offset, and they all operate in terms of disk sectors,
we can store the ranges using 4-byte integers as long as the size of
the metaslab divided by the sector size is less than 2^32. For 512-byte
sectors, this is a 2^41 (or 2TB) metaslab, which with the default
settings corresponds to a 256PB disk. 4k sector disks can handle
metaslabs up to 2^46 bytes, or 2^63 byte disks. Since we do not
anticipate disks of this size in the near future, there should be
almost no cases where metaslabs need 64-byte integers to store their
ranges. We do still have the capability to store 64-byte integer ranges
to account for cases where we are storing per-vdev (or per-dnode) trees,
which could reasonably go above the limits discussed. We also do not
store fill information in the compact version of the node, since it
is only used for sorted scrub.
We also optimized the metaslab loading process in various other ways
to offset some inefficiencies in the btree model. While individual
operations (find, insert, remove_from) are faster for the btree than
they are for the avl tree, remove usually requires a find operation,
while in the AVL tree model the element itself suffices. Some clever
changes actually caused an overall speedup in metaslab loading; we use
approximately 40% less cpu to load metaslabs in our tests on Illumos.
Another memory and performance optimization was achieved by changing
what is stored in the size-sorted trees. When a disk is heavily
fragmented, the df algorithm used by default in ZFS will almost always
find a number of small regions in its initial cursor-based search; it
will usually only fall back to the size-sorted tree to find larger
regions. If we increase the size of the cursor-based search slightly,
and don't store segments that are smaller than a tunable size floor
in the size-sorted tree, we can further cut memory usage down to
below 20% of what the AVL trees store. This also results in further
reductions in CPU time spent loading metaslabs.
The 16KiB size floor was chosen because it results in substantial memory
usage reduction while not usually resulting in situations where we can't
find an appropriate chunk with the cursor and are forced to use an
oversized chunk from the size-sorted tree. In addition, even if we do
have to use an oversized chunk from the size-sorted tree, the chunk
would be too small to use for ZIL allocations, so it isn't as big of a
loss as it might otherwise be. And often, more small allocations will
follow the initial one, and the cursor search will now find the
remainder of the chunk we didn't use all of and use it for subsequent
allocations. Practical testing has shown little or no change in
fragmentation as a result of this change.
If the size-sorted tree becomes empty while the offset sorted one still
has entries, it will load all the entries from the offset sorted tree
and disregard the size floor until it is unloaded again. This operation
occurs rarely with the default setting, only on incredibly thoroughly
fragmented pools.
There are some other small changes to zdb to teach it to handle btrees,
but nothing major.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sebastien Roy seb@delphix.com
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9181
2019-10-09 20:36:03 +03:00
|
|
|
return (TREE_ISIGN(ret));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Utility function which retrieves copies of the current logs and
|
|
|
|
* re-initializes them in the process.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
spa_get_errlists(spa_t *spa, avl_tree_t *last, avl_tree_t *scrub)
|
|
|
|
{
|
|
|
|
ASSERT(MUTEX_HELD(&spa->spa_errlist_lock));
|
|
|
|
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(last, &spa->spa_errlist_last, sizeof (avl_tree_t));
|
|
|
|
memcpy(scrub, &spa->spa_errlist_scrub, sizeof (avl_tree_t));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
avl_create(&spa->spa_errlist_scrub,
|
|
|
|
spa_error_entry_compare, sizeof (spa_error_entry_t),
|
|
|
|
offsetof(spa_error_entry_t, se_avl));
|
|
|
|
avl_create(&spa->spa_errlist_last,
|
|
|
|
spa_error_entry_compare, sizeof (spa_error_entry_t),
|
|
|
|
offsetof(spa_error_entry_t, se_avl));
|
|
|
|
}
|
|
|
|
|
2013-05-06 23:24:30 +04:00
|
|
|
static void
|
|
|
|
spa_taskqs_init(spa_t *spa, zio_type_t t, zio_taskq_type_t q)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2013-05-06 23:24:30 +04:00
|
|
|
const zio_taskq_info_t *ztip = &zio_taskqs[t][q];
|
|
|
|
enum zti_modes mode = ztip->zti_mode;
|
|
|
|
uint_t value = ztip->zti_value;
|
|
|
|
uint_t count = ztip->zti_count;
|
|
|
|
spa_taskqs_t *tqs = &spa->spa_zio_taskq[t][q];
|
Scale worker threads and taskqs with number of CPUs
While use of dynamic taskqs allows to reduce number of idle threads,
hardcoded 8 taskqs of each kind is a big overkill for small systems,
complicating CPU scheduling, increasing I/O reorder, etc, while
providing no real locking benefits, just not needed there.
On another side, 12*8 worker threads per kind are able to overload
almost any system nowadays. For example, pool of several fast SSDs
with SHA256 checksum makes system barely responsive during scrub, or
with dedup enabled barely responsive during large file deletion.
To address both problems this patch introduces ZTI_SCALE macro, alike
to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs,
to be used in places where lock scalability is needed, while request
ordering is not so much. The code is made to create new taskq for
~6 worker threads (less for small systems, but more for very large)
up to 80% of CPU cores (previous 75% was not good for rounding down).
Both number of threads and threads per taskq are now tunable in case
somebody really wants to use all of system power for ZFS.
While obviously some benchmarks show small peak performance reduction
(not so big really, especially on systems with SMT, where use of the
second threads does not give as much performance as the first ones),
they also show dramatic latency reduction and much more smooth user-
space operation in case of high CPU usage by ZFS.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #11966
2021-05-14 19:13:53 +03:00
|
|
|
uint_t cpus, flags = TASKQ_DYNAMIC;
|
2010-05-29 00:45:14 +04:00
|
|
|
boolean_t batch = B_FALSE;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
switch (mode) {
|
|
|
|
case ZTI_MODE_FIXED:
|
Scale worker threads and taskqs with number of CPUs
While use of dynamic taskqs allows to reduce number of idle threads,
hardcoded 8 taskqs of each kind is a big overkill for small systems,
complicating CPU scheduling, increasing I/O reorder, etc, while
providing no real locking benefits, just not needed there.
On another side, 12*8 worker threads per kind are able to overload
almost any system nowadays. For example, pool of several fast SSDs
with SHA256 checksum makes system barely responsive during scrub, or
with dedup enabled barely responsive during large file deletion.
To address both problems this patch introduces ZTI_SCALE macro, alike
to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs,
to be used in places where lock scalability is needed, while request
ordering is not so much. The code is made to create new taskq for
~6 worker threads (less for small systems, but more for very large)
up to 80% of CPU cores (previous 75% was not good for rounding down).
Both number of threads and threads per taskq are now tunable in case
somebody really wants to use all of system power for ZFS.
While obviously some benchmarks show small peak performance reduction
(not so big really, especially on systems with SMT, where use of the
second threads does not give as much performance as the first ones),
they also show dramatic latency reduction and much more smooth user-
space operation in case of high CPU usage by ZFS.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #11966
2021-05-14 19:13:53 +03:00
|
|
|
ASSERT3U(value, >, 0);
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
break;
|
2013-05-06 23:24:30 +04:00
|
|
|
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
case ZTI_MODE_BATCH:
|
|
|
|
batch = B_TRUE;
|
|
|
|
flags |= TASKQ_THREADS_CPU_PCT;
|
2015-12-16 22:22:32 +03:00
|
|
|
value = MIN(zio_taskq_batch_pct, 100);
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
break;
|
2013-05-06 23:24:30 +04:00
|
|
|
|
Scale worker threads and taskqs with number of CPUs
While use of dynamic taskqs allows to reduce number of idle threads,
hardcoded 8 taskqs of each kind is a big overkill for small systems,
complicating CPU scheduling, increasing I/O reorder, etc, while
providing no real locking benefits, just not needed there.
On another side, 12*8 worker threads per kind are able to overload
almost any system nowadays. For example, pool of several fast SSDs
with SHA256 checksum makes system barely responsive during scrub, or
with dedup enabled barely responsive during large file deletion.
To address both problems this patch introduces ZTI_SCALE macro, alike
to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs,
to be used in places where lock scalability is needed, while request
ordering is not so much. The code is made to create new taskq for
~6 worker threads (less for small systems, but more for very large)
up to 80% of CPU cores (previous 75% was not good for rounding down).
Both number of threads and threads per taskq are now tunable in case
somebody really wants to use all of system power for ZFS.
While obviously some benchmarks show small peak performance reduction
(not so big really, especially on systems with SMT, where use of the
second threads does not give as much performance as the first ones),
they also show dramatic latency reduction and much more smooth user-
space operation in case of high CPU usage by ZFS.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #11966
2021-05-14 19:13:53 +03:00
|
|
|
case ZTI_MODE_SCALE:
|
|
|
|
flags |= TASKQ_THREADS_CPU_PCT;
|
|
|
|
/*
|
|
|
|
* We want more taskqs to reduce lock contention, but we want
|
|
|
|
* less for better request ordering and CPU utilization.
|
|
|
|
*/
|
|
|
|
cpus = MAX(1, boot_ncpus * zio_taskq_batch_pct / 100);
|
|
|
|
if (zio_taskq_batch_tpq > 0) {
|
|
|
|
count = MAX(1, (cpus + zio_taskq_batch_tpq / 2) /
|
|
|
|
zio_taskq_batch_tpq);
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Prefer 6 threads per taskq, but no more taskqs
|
|
|
|
* than threads in them on large systems. For 80%:
|
|
|
|
*
|
|
|
|
* taskq taskq total
|
|
|
|
* cpus taskqs percent threads threads
|
|
|
|
* ------- ------- ------- ------- -------
|
|
|
|
* 1 1 80% 1 1
|
|
|
|
* 2 1 80% 1 1
|
|
|
|
* 4 1 80% 3 3
|
|
|
|
* 8 2 40% 3 6
|
|
|
|
* 16 3 27% 4 12
|
|
|
|
* 32 5 16% 5 25
|
|
|
|
* 64 7 11% 7 49
|
|
|
|
* 128 10 8% 10 100
|
|
|
|
* 256 14 6% 15 210
|
|
|
|
*/
|
|
|
|
count = 1 + cpus / 6;
|
|
|
|
while (count * count > cpus)
|
|
|
|
count--;
|
|
|
|
}
|
|
|
|
/* Limit each taskq within 100% to not trigger assertion. */
|
|
|
|
count = MAX(count, (zio_taskq_batch_pct + 99) / 100);
|
|
|
|
value = (zio_taskq_batch_pct + count / 2) / count;
|
|
|
|
break;
|
|
|
|
|
|
|
|
case ZTI_MODE_NULL:
|
|
|
|
tqs->stqs_count = 0;
|
|
|
|
tqs->stqs_taskq = NULL;
|
|
|
|
return;
|
|
|
|
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
default:
|
|
|
|
panic("unrecognized mode for %s_%s taskq (%u:%u) in "
|
|
|
|
"spa_activate()",
|
|
|
|
zio_type_name[t], zio_taskq_types[q], mode, value);
|
|
|
|
break;
|
|
|
|
}
|
2013-05-06 23:24:30 +04:00
|
|
|
|
Scale worker threads and taskqs with number of CPUs
While use of dynamic taskqs allows to reduce number of idle threads,
hardcoded 8 taskqs of each kind is a big overkill for small systems,
complicating CPU scheduling, increasing I/O reorder, etc, while
providing no real locking benefits, just not needed there.
On another side, 12*8 worker threads per kind are able to overload
almost any system nowadays. For example, pool of several fast SSDs
with SHA256 checksum makes system barely responsive during scrub, or
with dedup enabled barely responsive during large file deletion.
To address both problems this patch introduces ZTI_SCALE macro, alike
to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs,
to be used in places where lock scalability is needed, while request
ordering is not so much. The code is made to create new taskq for
~6 worker threads (less for small systems, but more for very large)
up to 80% of CPU cores (previous 75% was not good for rounding down).
Both number of threads and threads per taskq are now tunable in case
somebody really wants to use all of system power for ZFS.
While obviously some benchmarks show small peak performance reduction
(not so big really, especially on systems with SMT, where use of the
second threads does not give as much performance as the first ones),
they also show dramatic latency reduction and much more smooth user-
space operation in case of high CPU usage by ZFS.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #11966
2021-05-14 19:13:53 +03:00
|
|
|
ASSERT3U(count, >, 0);
|
|
|
|
tqs->stqs_count = count;
|
|
|
|
tqs->stqs_taskq = kmem_alloc(count * sizeof (taskq_t *), KM_SLEEP);
|
|
|
|
|
2017-11-04 23:25:13 +03:00
|
|
|
for (uint_t i = 0; i < count; i++) {
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
taskq_t *tq;
|
Remove suffix from zio taskq names
For zio taskq's which have multiple instances (e.g. z_rd_int_0,
z_rd_int_1, etc), each one has a unique name (the _0, _1, _2 suffix).
This makes performance analysis more difficult, because by default,
`perf` includes the thread name (which is the same as the taskq name) in
the stack trace. This means that we get 8 different stacks, all of
which are doing the same thing, but are executed from different taskq's.
We should remove the suffix of the taskq name, so that all the
read-interrupt threads are named z_rd_int.
Note that we already support multiple taskq's with the same name. This
happens when there are multiple pools. In this case the taskq has a
different tq_instance, which shows up in /proc/spl/taskq-all.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7646
2018-06-21 00:07:50 +03:00
|
|
|
char name[32];
|
2013-05-06 23:24:30 +04:00
|
|
|
|
Scale worker threads and taskqs with number of CPUs
While use of dynamic taskqs allows to reduce number of idle threads,
hardcoded 8 taskqs of each kind is a big overkill for small systems,
complicating CPU scheduling, increasing I/O reorder, etc, while
providing no real locking benefits, just not needed there.
On another side, 12*8 worker threads per kind are able to overload
almost any system nowadays. For example, pool of several fast SSDs
with SHA256 checksum makes system barely responsive during scrub, or
with dedup enabled barely responsive during large file deletion.
To address both problems this patch introduces ZTI_SCALE macro, alike
to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs,
to be used in places where lock scalability is needed, while request
ordering is not so much. The code is made to create new taskq for
~6 worker threads (less for small systems, but more for very large)
up to 80% of CPU cores (previous 75% was not good for rounding down).
Both number of threads and threads per taskq are now tunable in case
somebody really wants to use all of system power for ZFS.
While obviously some benchmarks show small peak performance reduction
(not so big really, especially on systems with SMT, where use of the
second threads does not give as much performance as the first ones),
they also show dramatic latency reduction and much more smooth user-
space operation in case of high CPU usage by ZFS.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #11966
2021-05-14 19:13:53 +03:00
|
|
|
if (count > 1)
|
|
|
|
(void) snprintf(name, sizeof (name), "%s_%s_%u",
|
|
|
|
zio_type_name[t], zio_taskq_types[q], i);
|
|
|
|
else
|
|
|
|
(void) snprintf(name, sizeof (name), "%s_%s",
|
|
|
|
zio_type_name[t], zio_taskq_types[q]);
|
2013-05-06 23:24:30 +04:00
|
|
|
|
|
|
|
if (zio_taskq_sysdc && spa->spa_proc != &p0) {
|
|
|
|
if (batch)
|
|
|
|
flags |= TASKQ_DC_BATCH;
|
|
|
|
|
2022-01-15 02:37:55 +03:00
|
|
|
(void) zio_taskq_basedc;
|
2013-05-06 23:24:30 +04:00
|
|
|
tq = taskq_create_sysdc(name, value, 50, INT_MAX,
|
|
|
|
spa->spa_proc, zio_taskq_basedc, flags);
|
|
|
|
} else {
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
pri_t pri = maxclsyspri;
|
|
|
|
/*
|
|
|
|
* The write issue taskq can be extremely CPU
|
2015-07-24 20:08:31 +03:00
|
|
|
* intensive. Run it at slightly less important
|
2020-09-04 21:13:27 +03:00
|
|
|
* priority than the other taskqs.
|
|
|
|
*
|
|
|
|
* Under Linux and FreeBSD this means incrementing
|
|
|
|
* the priority value as opposed to platforms like
|
|
|
|
* illumos where it should be decremented.
|
|
|
|
*
|
|
|
|
* On FreeBSD, if priorities divided by four (RQ_PPQ)
|
|
|
|
* are equal then a difference between them is
|
|
|
|
* insignificant.
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
*/
|
2020-09-04 21:13:27 +03:00
|
|
|
if (t == ZIO_TYPE_WRITE && q == ZIO_TASKQ_ISSUE) {
|
|
|
|
#if defined(__linux__)
|
2015-07-24 20:08:31 +03:00
|
|
|
pri++;
|
2020-09-04 21:13:27 +03:00
|
|
|
#elif defined(__FreeBSD__)
|
|
|
|
pri += 4;
|
|
|
|
#else
|
|
|
|
#error "unknown OS"
|
|
|
|
#endif
|
|
|
|
}
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
tq = taskq_create_proc(name, value, pri, 50,
|
2013-05-06 23:24:30 +04:00
|
|
|
INT_MAX, spa->spa_proc, flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
tqs->stqs_taskq[i] = tq;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
spa_taskqs_fini(spa_t *spa, zio_type_t t, zio_taskq_type_t q)
|
|
|
|
{
|
|
|
|
spa_taskqs_t *tqs = &spa->spa_zio_taskq[t][q];
|
|
|
|
|
|
|
|
if (tqs->stqs_taskq == NULL) {
|
|
|
|
ASSERT3U(tqs->stqs_count, ==, 0);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2017-11-04 23:25:13 +03:00
|
|
|
for (uint_t i = 0; i < tqs->stqs_count; i++) {
|
2013-05-06 23:24:30 +04:00
|
|
|
ASSERT3P(tqs->stqs_taskq[i], !=, NULL);
|
|
|
|
taskq_destroy(tqs->stqs_taskq[i]);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-05-06 23:24:30 +04:00
|
|
|
kmem_free(tqs->stqs_taskq, tqs->stqs_count * sizeof (taskq_t *));
|
|
|
|
tqs->stqs_taskq = NULL;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-05-06 23:24:30 +04:00
|
|
|
/*
|
|
|
|
* Dispatch a task to the appropriate taskq for the ZFS I/O type and priority.
|
|
|
|
* Note that a type may have multiple discrete taskqs to avoid lock contention
|
|
|
|
* on the taskq itself. In that case we choose which taskq at random by using
|
|
|
|
* the low bits of gethrtime().
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
spa_taskq_dispatch_ent(spa_t *spa, zio_type_t t, zio_taskq_type_t q,
|
|
|
|
task_func_t *func, void *arg, uint_t flags, taskq_ent_t *ent)
|
|
|
|
{
|
|
|
|
spa_taskqs_t *tqs = &spa->spa_zio_taskq[t][q];
|
|
|
|
taskq_t *tq;
|
|
|
|
|
|
|
|
ASSERT3P(tqs->stqs_taskq, !=, NULL);
|
|
|
|
ASSERT3U(tqs->stqs_count, !=, 0);
|
|
|
|
|
|
|
|
if (tqs->stqs_count == 1) {
|
|
|
|
tq = tqs->stqs_taskq[0];
|
|
|
|
} else {
|
2013-06-27 15:41:30 +04:00
|
|
|
tq = tqs->stqs_taskq[((uint64_t)gethrtime()) % tqs->stqs_count];
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2013-05-06 23:24:30 +04:00
|
|
|
|
|
|
|
taskq_dispatch_ent(tq, func, arg, flags, ent);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2013-05-04 01:17:21 +04:00
|
|
|
/*
|
|
|
|
* Same as spa_taskq_dispatch_ent() but block on the task until completion.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
spa_taskq_dispatch_sync(spa_t *spa, zio_type_t t, zio_taskq_type_t q,
|
|
|
|
task_func_t *func, void *arg, uint_t flags)
|
|
|
|
{
|
|
|
|
spa_taskqs_t *tqs = &spa->spa_zio_taskq[t][q];
|
|
|
|
taskq_t *tq;
|
|
|
|
taskqid_t id;
|
|
|
|
|
|
|
|
ASSERT3P(tqs->stqs_taskq, !=, NULL);
|
|
|
|
ASSERT3U(tqs->stqs_count, !=, 0);
|
|
|
|
|
|
|
|
if (tqs->stqs_count == 1) {
|
|
|
|
tq = tqs->stqs_taskq[0];
|
|
|
|
} else {
|
2013-06-27 15:41:30 +04:00
|
|
|
tq = tqs->stqs_taskq[((uint64_t)gethrtime()) % tqs->stqs_count];
|
2013-05-04 01:17:21 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
id = taskq_dispatch(tq, func, arg, flags);
|
|
|
|
if (id)
|
|
|
|
taskq_wait_id(tq, id);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
spa_create_zio_taskqs(spa_t *spa)
|
|
|
|
{
|
2017-11-04 23:25:13 +03:00
|
|
|
for (int t = 0; t < ZIO_TYPES; t++) {
|
|
|
|
for (int q = 0; q < ZIO_TASKQ_TYPES; q++) {
|
2013-05-06 23:24:30 +04:00
|
|
|
spa_taskqs_init(spa, t, q);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2009-07-03 02:44:48 +04:00
|
|
|
|
Simplify threads, mutexs, cvs and rwlocks
* Simplify threads, mutexs, cvs and rwlocks
* Update the zk_thread_create() function to use the same trick
as Illumos. Specifically, cast the new pthread_t to a void
pointer and return that as the kthread_t *. This avoids the
issues associated with managing a wrapper structure and is
safe as long as the callers never attempt to dereference it.
* Update all function prototypes passed to pthread_create() to
match the expected prototype. We were getting away this with
before since the function were explicitly cast.
* Replaced direct zk_thread_create() calls with thread_create()
for code consistency. All consumers of libzpool now use the
proper wrappers.
* The mutex_held() calls were converted to MUTEX_HELD().
* Removed all mutex_owner() calls and retired the interface.
Instead use MUTEX_HELD() which provides the same information
and allows the implementation details to be hidden. In this
case the use of the pthread_equals() function.
* The kthread_t, kmutex_t, krwlock_t, and krwlock_t types had
any non essential fields removed. In the case of kthread_t
and kcondvar_t they could be directly typedef'd to pthread_t
and pthread_cond_t respectively.
* Removed all extra ASSERTS from the thread, mutex, rwlock, and
cv wrapper functions. In practice, pthreads already provides
the vast majority of checks as long as we check the return
code. Removing this code from our wrappers help readability.
* Added TS_JOINABLE state flag to pass to request a joinable rather
than detached thread. This isn't a standard thread_create() state
but it's the least invasive way to pass this information and is
only used by ztest.
TEST_ZTEST_TIMEOUT=3600
Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4547
Closes #5503
Closes #5523
Closes #6377
Closes #6495
2017-08-11 18:51:44 +03:00
|
|
|
/*
|
|
|
|
* Disabled until spa_thread() can be adapted for Linux.
|
|
|
|
*/
|
|
|
|
#undef HAVE_SPA_THREAD
|
|
|
|
|
2010-08-26 22:52:20 +04:00
|
|
|
#if defined(_KERNEL) && defined(HAVE_SPA_THREAD)
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
spa_thread(void *arg)
|
|
|
|
{
|
2018-02-16 04:53:18 +03:00
|
|
|
psetid_t zio_taskq_psrset_bind = PS_NONE;
|
2010-05-29 00:45:14 +04:00
|
|
|
callb_cpr_t cprinfo;
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_t *spa = arg;
|
|
|
|
user_t *pu = PTOU(curproc);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
CALLB_CPR_INIT(&cprinfo, &spa->spa_proc_lock, callb_generic_cpr,
|
|
|
|
spa->spa_name);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(curproc != &p0);
|
|
|
|
(void) snprintf(pu->u_psargs, sizeof (pu->u_psargs),
|
|
|
|
"zpool-%s", spa->spa_name);
|
|
|
|
(void) strlcpy(pu->u_comm, pu->u_psargs, sizeof (pu->u_comm));
|
|
|
|
|
|
|
|
/* bind this thread to the requested psrset */
|
|
|
|
if (zio_taskq_psrset_bind != PS_NONE) {
|
|
|
|
pool_lock();
|
|
|
|
mutex_enter(&cpu_lock);
|
|
|
|
mutex_enter(&pidlock);
|
|
|
|
mutex_enter(&curproc->p_lock);
|
|
|
|
|
|
|
|
if (cpupart_bind_thread(curthread, zio_taskq_psrset_bind,
|
|
|
|
0, NULL, NULL) == 0) {
|
|
|
|
curthread->t_bind_pset = zio_taskq_psrset_bind;
|
|
|
|
} else {
|
|
|
|
cmn_err(CE_WARN,
|
|
|
|
"Couldn't bind process for zfs pool \"%s\" to "
|
|
|
|
"pset %d\n", spa->spa_name, zio_taskq_psrset_bind);
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_exit(&curproc->p_lock);
|
|
|
|
mutex_exit(&pidlock);
|
|
|
|
mutex_exit(&cpu_lock);
|
|
|
|
pool_unlock();
|
|
|
|
}
|
|
|
|
|
|
|
|
if (zio_taskq_sysdc) {
|
|
|
|
sysdc_thread_enter(curthread, 100, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
spa->spa_proc = curproc;
|
|
|
|
spa->spa_did = curthread->t_did;
|
|
|
|
|
|
|
|
spa_create_zio_taskqs(spa);
|
|
|
|
|
|
|
|
mutex_enter(&spa->spa_proc_lock);
|
|
|
|
ASSERT(spa->spa_proc_state == SPA_PROC_CREATED);
|
|
|
|
|
|
|
|
spa->spa_proc_state = SPA_PROC_ACTIVE;
|
|
|
|
cv_broadcast(&spa->spa_proc_cv);
|
|
|
|
|
|
|
|
CALLB_CPR_SAFE_BEGIN(&cprinfo);
|
|
|
|
while (spa->spa_proc_state == SPA_PROC_ACTIVE)
|
|
|
|
cv_wait(&spa->spa_proc_cv, &spa->spa_proc_lock);
|
|
|
|
CALLB_CPR_SAFE_END(&cprinfo, &spa->spa_proc_lock);
|
|
|
|
|
|
|
|
ASSERT(spa->spa_proc_state == SPA_PROC_DEACTIVATE);
|
|
|
|
spa->spa_proc_state = SPA_PROC_GONE;
|
|
|
|
spa->spa_proc = &p0;
|
|
|
|
cv_broadcast(&spa->spa_proc_cv);
|
|
|
|
CALLB_CPR_EXIT(&cprinfo); /* drops spa_proc_lock */
|
|
|
|
|
|
|
|
mutex_enter(&curproc->p_lock);
|
|
|
|
lwp_exit();
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Activate an uninitialized pool.
|
|
|
|
*/
|
|
|
|
static void
|
2019-11-21 20:32:57 +03:00
|
|
|
spa_activate(spa_t *spa, spa_mode_t mode)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
|
|
|
ASSERT(spa->spa_state == POOL_STATE_UNINITIALIZED);
|
|
|
|
|
|
|
|
spa->spa_state = POOL_STATE_ACTIVE;
|
|
|
|
spa->spa_mode = mode;
|
2021-11-09 23:50:39 +03:00
|
|
|
spa->spa_read_spacemaps = spa_mode_readable_spacemaps;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2022-01-15 02:37:55 +03:00
|
|
|
spa->spa_normal_class = metaslab_class_create(spa, &zfs_metaslab_ops);
|
|
|
|
spa->spa_log_class = metaslab_class_create(spa, &zfs_metaslab_ops);
|
Set aside a metaslab for ZIL blocks
Mixing ZIL and normal allocations has several problems:
1. The ZIL allocations are allocated, written to disk, and then a few
seconds later freed. This leaves behind holes (free segments) where the
ZIL blocks used to be, which increases fragmentation, which negatively
impacts performance.
2. When under moderate load, ZIL allocations are of 128KB. If the pool
is fairly fragmented, there may not be many free chunks of that size.
This causes ZFS to load more metaslabs to locate free segments of 128KB
or more. The loading happens synchronously (from zil_commit()), and can
take around a second even if the metaslab's spacemap is cached in the
ARC. All concurrent synchronous operations on this filesystem must wait
while the metaslab is loading. This can cause a significant performance
impact.
3. If the pool is very fragmented, there may be zero free chunks of
128KB or more. In this case, the ZIL falls back to txg_wait_synced(),
which has an enormous performance impact.
These problems can be eliminated by using a dedicated log device
("slog"), even one with the same performance characteristics as the
normal devices.
This change sets aside one metaslab from each top-level vdev that is
preferentially used for ZIL allocations (vdev_log_mg,
spa_embedded_log_class). From an allocation perspective, this is
similar to having a dedicated log device, and it eliminates the
above-mentioned performance problems.
Log (ZIL) blocks can be allocated from the following locations. Each
one is tried in order until the allocation succeeds:
1. dedicated log vdevs, aka "slog" (spa_log_class)
2. embedded slog metaslabs (spa_embedded_log_class)
3. other metaslabs in normal vdevs (spa_normal_class)
The space required for the embedded slog metaslabs is usually between
0.5% and 1.0% of the pool, and comes out of the existing 3.2% of "slop"
space that is not available for user data.
On an all-ssd system with 4TB storage, 87% fragmentation, 60% capacity,
and recordsize=8k, testing shows a ~50% performance increase on random
8k sync writes. On even more fragmented systems (which hit problem #3
above and call txg_wait_synced()), the performance improvement can be
arbitrarily large (>100x).
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11389
2021-01-22 02:12:54 +03:00
|
|
|
spa->spa_embedded_log_class =
|
2022-01-15 02:37:55 +03:00
|
|
|
metaslab_class_create(spa, &zfs_metaslab_ops);
|
|
|
|
spa->spa_special_class = metaslab_class_create(spa, &zfs_metaslab_ops);
|
|
|
|
spa->spa_dedup_class = metaslab_class_create(spa, &zfs_metaslab_ops);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
/* Try to create a covering process */
|
|
|
|
mutex_enter(&spa->spa_proc_lock);
|
|
|
|
ASSERT(spa->spa_proc_state == SPA_PROC_NONE);
|
|
|
|
ASSERT(spa->spa_proc == &p0);
|
|
|
|
spa->spa_did = 0;
|
|
|
|
|
2022-01-15 02:37:55 +03:00
|
|
|
(void) spa_create_process;
|
2010-08-26 22:52:20 +04:00
|
|
|
#ifdef HAVE_SPA_THREAD
|
2010-05-29 00:45:14 +04:00
|
|
|
/* Only create a process if we're going to be around a while. */
|
|
|
|
if (spa_create_process && strcmp(spa->spa_name, TRYIMPORT_NAME) != 0) {
|
|
|
|
if (newproc(spa_thread, (caddr_t)spa, syscid, maxclsyspri,
|
|
|
|
NULL, 0) == 0) {
|
|
|
|
spa->spa_proc_state = SPA_PROC_CREATED;
|
|
|
|
while (spa->spa_proc_state == SPA_PROC_CREATED) {
|
|
|
|
cv_wait(&spa->spa_proc_cv,
|
|
|
|
&spa->spa_proc_lock);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(spa->spa_proc_state == SPA_PROC_ACTIVE);
|
|
|
|
ASSERT(spa->spa_proc != &p0);
|
|
|
|
ASSERT(spa->spa_did != 0);
|
|
|
|
} else {
|
|
|
|
#ifdef _KERNEL
|
|
|
|
cmn_err(CE_WARN,
|
|
|
|
"Couldn't create process for zfs pool \"%s\"\n",
|
|
|
|
spa->spa_name);
|
|
|
|
#endif
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-08-26 22:52:20 +04:00
|
|
|
#endif /* HAVE_SPA_THREAD */
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_exit(&spa->spa_proc_lock);
|
|
|
|
|
|
|
|
/* If we didn't create a process, we need to create our taskqs. */
|
|
|
|
if (spa->spa_proc == &p0) {
|
|
|
|
spa_create_zio_taskqs(spa);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
for (size_t i = 0; i < TXG_SIZE; i++) {
|
|
|
|
spa->spa_txg_zio[i] = zio_root(spa, NULL, NULL,
|
|
|
|
ZIO_FLAG_CANFAIL);
|
|
|
|
}
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
list_create(&spa->spa_config_dirty_list, sizeof (vdev_t),
|
|
|
|
offsetof(vdev_t, vdev_config_dirty_node));
|
2015-04-02 06:44:32 +03:00
|
|
|
list_create(&spa->spa_evicting_os_list, sizeof (objset_t),
|
|
|
|
offsetof(objset_t, os_evicting_node));
|
2008-12-03 23:09:06 +03:00
|
|
|
list_create(&spa->spa_state_dirty_list, sizeof (vdev_t),
|
|
|
|
offsetof(vdev_t, vdev_state_dirty_node));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-04-24 19:34:36 +03:00
|
|
|
txg_list_create(&spa->spa_vdev_txg_list, spa,
|
2008-11-20 23:01:55 +03:00
|
|
|
offsetof(struct vdev, vdev_txg_node));
|
|
|
|
|
|
|
|
avl_create(&spa->spa_errlist_scrub,
|
|
|
|
spa_error_entry_compare, sizeof (spa_error_entry_t),
|
|
|
|
offsetof(spa_error_entry_t, se_avl));
|
|
|
|
avl_create(&spa->spa_errlist_last,
|
|
|
|
spa_error_entry_compare, sizeof (spa_error_entry_t),
|
|
|
|
offsetof(spa_error_entry_t, se_avl));
|
2022-07-29 01:52:46 +03:00
|
|
|
avl_create(&spa->spa_errlist_healed,
|
|
|
|
spa_error_entry_compare, sizeof (spa_error_entry_t),
|
|
|
|
offsetof(spa_error_entry_t, se_avl));
|
2014-03-22 13:07:14 +04:00
|
|
|
|
2022-02-16 02:54:25 +03:00
|
|
|
spa_activate_os(spa);
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
spa_keystore_init(&spa->spa_keystore);
|
|
|
|
|
2014-03-22 13:07:14 +04:00
|
|
|
/*
|
|
|
|
* This taskq is used to perform zvol-minor-related tasks
|
|
|
|
* asynchronously. This has several advantages, including easy
|
2020-10-09 06:10:13 +03:00
|
|
|
* resolution of various deadlocks.
|
2014-03-22 13:07:14 +04:00
|
|
|
*
|
|
|
|
* The taskq must be single threaded to ensure tasks are always
|
|
|
|
* processed in the order in which they were dispatched.
|
|
|
|
*
|
|
|
|
* A taskq per pool allows one to keep the pools independent.
|
|
|
|
* This way if one pool is suspended, it will not impact another.
|
|
|
|
*
|
|
|
|
* The preferred location to dispatch a zvol minor task is a sync
|
|
|
|
* task. In this context, there is easy access to the spa_t and minimal
|
|
|
|
* error handling is required because the sync task must succeed.
|
|
|
|
*/
|
|
|
|
spa->spa_zvol_taskq = taskq_create("z_zvol", 1, defclsyspri,
|
|
|
|
1, INT_MAX, 0);
|
2016-10-04 21:46:10 +03:00
|
|
|
|
2018-03-30 22:10:01 +03:00
|
|
|
/*
|
|
|
|
* Taskq dedicated to prefetcher threads: this is used to prevent the
|
|
|
|
* pool traverse code from monopolizing the global (and limited)
|
|
|
|
* system_taskq by inappropriately scheduling long running tasks on it.
|
|
|
|
*/
|
2020-12-11 01:09:23 +03:00
|
|
|
spa->spa_prefetch_taskq = taskq_create("z_prefetch", 100,
|
|
|
|
defclsyspri, 1, INT_MAX, TASKQ_DYNAMIC | TASKQ_THREADS_CPU_PCT);
|
2018-03-30 22:10:01 +03:00
|
|
|
|
2016-10-04 21:46:10 +03:00
|
|
|
/*
|
|
|
|
* The taskq to upgrade datasets in this pool. Currently used by
|
2018-02-14 01:54:54 +03:00
|
|
|
* feature SPA_FEATURE_USEROBJ_ACCOUNTING/SPA_FEATURE_PROJECT_QUOTA.
|
2016-10-04 21:46:10 +03:00
|
|
|
*/
|
2020-12-11 01:09:23 +03:00
|
|
|
spa->spa_upgrade_taskq = taskq_create("z_upgrade", 100,
|
|
|
|
defclsyspri, 1, INT_MAX, TASKQ_DYNAMIC | TASKQ_THREADS_CPU_PCT);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Opposite of spa_activate().
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
spa_deactivate(spa_t *spa)
|
|
|
|
{
|
|
|
|
ASSERT(spa->spa_sync_on == B_FALSE);
|
|
|
|
ASSERT(spa->spa_dsl_pool == NULL);
|
|
|
|
ASSERT(spa->spa_root_vdev == NULL);
|
2009-07-03 02:44:48 +04:00
|
|
|
ASSERT(spa->spa_async_zio_root == NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(spa->spa_state != POOL_STATE_UNINITIALIZED);
|
|
|
|
|
2015-04-02 06:44:32 +03:00
|
|
|
spa_evicting_os_wait(spa);
|
|
|
|
|
2014-03-22 13:07:14 +04:00
|
|
|
if (spa->spa_zvol_taskq) {
|
|
|
|
taskq_destroy(spa->spa_zvol_taskq);
|
|
|
|
spa->spa_zvol_taskq = NULL;
|
|
|
|
}
|
|
|
|
|
2018-03-30 22:10:01 +03:00
|
|
|
if (spa->spa_prefetch_taskq) {
|
|
|
|
taskq_destroy(spa->spa_prefetch_taskq);
|
|
|
|
spa->spa_prefetch_taskq = NULL;
|
|
|
|
}
|
|
|
|
|
2016-10-04 21:46:10 +03:00
|
|
|
if (spa->spa_upgrade_taskq) {
|
|
|
|
taskq_destroy(spa->spa_upgrade_taskq);
|
|
|
|
spa->spa_upgrade_taskq = NULL;
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
txg_list_destroy(&spa->spa_vdev_txg_list);
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
list_destroy(&spa->spa_config_dirty_list);
|
2015-04-02 06:44:32 +03:00
|
|
|
list_destroy(&spa->spa_evicting_os_list);
|
2008-12-03 23:09:06 +03:00
|
|
|
list_destroy(&spa->spa_state_dirty_list);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-12-01 00:56:50 +03:00
|
|
|
taskq_cancel_id(system_delay_taskq, spa->spa_deadman_tqid);
|
2013-04-30 02:49:23 +04:00
|
|
|
|
2017-11-04 23:25:13 +03:00
|
|
|
for (int t = 0; t < ZIO_TYPES; t++) {
|
|
|
|
for (int q = 0; q < ZIO_TASKQ_TYPES; q++) {
|
2013-05-06 23:24:30 +04:00
|
|
|
spa_taskqs_fini(spa, t, q);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
for (size_t i = 0; i < TXG_SIZE; i++) {
|
|
|
|
ASSERT3P(spa->spa_txg_zio[i], !=, NULL);
|
|
|
|
VERIFY0(zio_wait(spa->spa_txg_zio[i]));
|
|
|
|
spa->spa_txg_zio[i] = NULL;
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
metaslab_class_destroy(spa->spa_normal_class);
|
|
|
|
spa->spa_normal_class = NULL;
|
|
|
|
|
|
|
|
metaslab_class_destroy(spa->spa_log_class);
|
|
|
|
spa->spa_log_class = NULL;
|
|
|
|
|
Set aside a metaslab for ZIL blocks
Mixing ZIL and normal allocations has several problems:
1. The ZIL allocations are allocated, written to disk, and then a few
seconds later freed. This leaves behind holes (free segments) where the
ZIL blocks used to be, which increases fragmentation, which negatively
impacts performance.
2. When under moderate load, ZIL allocations are of 128KB. If the pool
is fairly fragmented, there may not be many free chunks of that size.
This causes ZFS to load more metaslabs to locate free segments of 128KB
or more. The loading happens synchronously (from zil_commit()), and can
take around a second even if the metaslab's spacemap is cached in the
ARC. All concurrent synchronous operations on this filesystem must wait
while the metaslab is loading. This can cause a significant performance
impact.
3. If the pool is very fragmented, there may be zero free chunks of
128KB or more. In this case, the ZIL falls back to txg_wait_synced(),
which has an enormous performance impact.
These problems can be eliminated by using a dedicated log device
("slog"), even one with the same performance characteristics as the
normal devices.
This change sets aside one metaslab from each top-level vdev that is
preferentially used for ZIL allocations (vdev_log_mg,
spa_embedded_log_class). From an allocation perspective, this is
similar to having a dedicated log device, and it eliminates the
above-mentioned performance problems.
Log (ZIL) blocks can be allocated from the following locations. Each
one is tried in order until the allocation succeeds:
1. dedicated log vdevs, aka "slog" (spa_log_class)
2. embedded slog metaslabs (spa_embedded_log_class)
3. other metaslabs in normal vdevs (spa_normal_class)
The space required for the embedded slog metaslabs is usually between
0.5% and 1.0% of the pool, and comes out of the existing 3.2% of "slop"
space that is not available for user data.
On an all-ssd system with 4TB storage, 87% fragmentation, 60% capacity,
and recordsize=8k, testing shows a ~50% performance increase on random
8k sync writes. On even more fragmented systems (which hit problem #3
above and call txg_wait_synced()), the performance improvement can be
arbitrarily large (>100x).
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11389
2021-01-22 02:12:54 +03:00
|
|
|
metaslab_class_destroy(spa->spa_embedded_log_class);
|
|
|
|
spa->spa_embedded_log_class = NULL;
|
|
|
|
|
2018-09-06 04:33:36 +03:00
|
|
|
metaslab_class_destroy(spa->spa_special_class);
|
|
|
|
spa->spa_special_class = NULL;
|
|
|
|
|
|
|
|
metaslab_class_destroy(spa->spa_dedup_class);
|
|
|
|
spa->spa_dedup_class = NULL;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* If this was part of an import or the open otherwise failed, we may
|
|
|
|
* still have errors left in the queues. Empty them just in case.
|
|
|
|
*/
|
|
|
|
spa_errlog_drain(spa);
|
|
|
|
avl_destroy(&spa->spa_errlist_scrub);
|
|
|
|
avl_destroy(&spa->spa_errlist_last);
|
2022-07-29 01:52:46 +03:00
|
|
|
avl_destroy(&spa->spa_errlist_healed);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
spa_keystore_fini(&spa->spa_keystore);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
spa->spa_state = POOL_STATE_UNINITIALIZED;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
mutex_enter(&spa->spa_proc_lock);
|
|
|
|
if (spa->spa_proc_state != SPA_PROC_NONE) {
|
|
|
|
ASSERT(spa->spa_proc_state == SPA_PROC_ACTIVE);
|
|
|
|
spa->spa_proc_state = SPA_PROC_DEACTIVATE;
|
|
|
|
cv_broadcast(&spa->spa_proc_cv);
|
|
|
|
while (spa->spa_proc_state == SPA_PROC_DEACTIVATE) {
|
|
|
|
ASSERT(spa->spa_proc != &p0);
|
|
|
|
cv_wait(&spa->spa_proc_cv, &spa->spa_proc_lock);
|
|
|
|
}
|
|
|
|
ASSERT(spa->spa_proc_state == SPA_PROC_GONE);
|
|
|
|
spa->spa_proc_state = SPA_PROC_NONE;
|
|
|
|
}
|
|
|
|
ASSERT(spa->spa_proc == &p0);
|
|
|
|
mutex_exit(&spa->spa_proc_lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We want to make sure spa_thread() has actually exited the ZFS
|
|
|
|
* module, so that the module can't be unloaded out from underneath
|
|
|
|
* it.
|
|
|
|
*/
|
|
|
|
if (spa->spa_did != 0) {
|
|
|
|
thread_join(spa->spa_did);
|
|
|
|
spa->spa_did = 0;
|
|
|
|
}
|
2022-02-16 02:54:25 +03:00
|
|
|
|
|
|
|
spa_deactivate_os(spa);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify a pool configuration, and construct the vdev tree appropriately. This
|
|
|
|
* will create all the necessary vdevs in the appropriate layout, with each vdev
|
|
|
|
* in the CLOSED state. This will prep the pool before open/creation/import.
|
|
|
|
* All vdev validation is done by the vdev_alloc() routine.
|
|
|
|
*/
|
2019-10-28 19:51:53 +03:00
|
|
|
int
|
2008-11-20 23:01:55 +03:00
|
|
|
spa_config_parse(spa_t *spa, vdev_t **vdp, nvlist_t *nv, vdev_t *parent,
|
|
|
|
uint_t id, int atype)
|
|
|
|
{
|
|
|
|
nvlist_t **child;
|
2009-07-03 02:44:48 +04:00
|
|
|
uint_t children;
|
2008-11-20 23:01:55 +03:00
|
|
|
int error;
|
|
|
|
|
|
|
|
if ((error = vdev_alloc(spa, vdp, nv, parent, id, atype)) != 0)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
if ((*vdp)->vdev_ops->vdev_op_leaf)
|
|
|
|
return (0);
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
error = nvlist_lookup_nvlist_array(nv, ZPOOL_CONFIG_CHILDREN,
|
|
|
|
&child, &children);
|
|
|
|
|
|
|
|
if (error == ENOENT)
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
if (error) {
|
2008-11-20 23:01:55 +03:00
|
|
|
vdev_free(*vdp);
|
|
|
|
*vdp = NULL;
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(EINVAL));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2017-11-04 23:25:13 +03:00
|
|
|
for (int c = 0; c < children; c++) {
|
2008-11-20 23:01:55 +03:00
|
|
|
vdev_t *vd;
|
|
|
|
if ((error = spa_config_parse(spa, &vd, child[c], *vdp, c,
|
|
|
|
atype)) != 0) {
|
|
|
|
vdev_free(*vdp);
|
|
|
|
*vdp = NULL;
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
ASSERT(*vdp != NULL);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
static boolean_t
|
|
|
|
spa_should_flush_logs_on_unload(spa_t *spa)
|
|
|
|
{
|
|
|
|
if (!spa_feature_is_active(spa, SPA_FEATURE_LOG_SPACEMAP))
|
|
|
|
return (B_FALSE);
|
|
|
|
|
|
|
|
if (!spa_writeable(spa))
|
|
|
|
return (B_FALSE);
|
|
|
|
|
|
|
|
if (!spa->spa_sync_on)
|
|
|
|
return (B_FALSE);
|
|
|
|
|
|
|
|
if (spa_state(spa) != POOL_STATE_EXPORTED)
|
|
|
|
return (B_FALSE);
|
|
|
|
|
|
|
|
if (zfs_keep_log_spacemaps_at_export)
|
|
|
|
return (B_FALSE);
|
|
|
|
|
|
|
|
return (B_TRUE);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Opens a transaction that will set the flag that will instruct
|
|
|
|
* spa_sync to attempt to flush all the metaslabs for that txg.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
spa_unload_log_sm_flush_all(spa_t *spa)
|
|
|
|
{
|
|
|
|
dmu_tx_t *tx = dmu_tx_create_dd(spa_get_dsl(spa)->dp_mos_dir);
|
|
|
|
VERIFY0(dmu_tx_assign(tx, TXG_WAIT));
|
|
|
|
|
|
|
|
ASSERT3U(spa->spa_log_flushall_txg, ==, 0);
|
|
|
|
spa->spa_log_flushall_txg = dmu_tx_get_txg(tx);
|
|
|
|
|
|
|
|
dmu_tx_commit(tx);
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), spa->spa_log_flushall_txg);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
spa_unload_log_sm_metadata(spa_t *spa)
|
|
|
|
{
|
|
|
|
void *cookie = NULL;
|
|
|
|
spa_log_sm_t *sls;
|
|
|
|
while ((sls = avl_destroy_nodes(&spa->spa_sm_logs_by_txg,
|
|
|
|
&cookie)) != NULL) {
|
|
|
|
VERIFY0(sls->sls_mscount);
|
|
|
|
kmem_free(sls, sizeof (spa_log_sm_t));
|
|
|
|
}
|
|
|
|
|
|
|
|
for (log_summary_entry_t *e = list_head(&spa->spa_log_summary);
|
|
|
|
e != NULL; e = list_head(&spa->spa_log_summary)) {
|
|
|
|
VERIFY0(e->lse_mscount);
|
|
|
|
list_remove(&spa->spa_log_summary, e);
|
|
|
|
kmem_free(e, sizeof (log_summary_entry_t));
|
|
|
|
}
|
|
|
|
|
|
|
|
spa->spa_unflushed_stats.sus_nblocks = 0;
|
|
|
|
spa->spa_unflushed_stats.sus_memused = 0;
|
|
|
|
spa->spa_unflushed_stats.sus_blocklimit = 0;
|
|
|
|
}
|
|
|
|
|
2019-07-26 20:54:14 +03:00
|
|
|
static void
|
|
|
|
spa_destroy_aux_threads(spa_t *spa)
|
|
|
|
{
|
|
|
|
if (spa->spa_condense_zthr != NULL) {
|
|
|
|
zthr_destroy(spa->spa_condense_zthr);
|
|
|
|
spa->spa_condense_zthr = NULL;
|
|
|
|
}
|
|
|
|
if (spa->spa_checkpoint_discard_zthr != NULL) {
|
|
|
|
zthr_destroy(spa->spa_checkpoint_discard_zthr);
|
|
|
|
spa->spa_checkpoint_discard_zthr = NULL;
|
|
|
|
}
|
|
|
|
if (spa->spa_livelist_delete_zthr != NULL) {
|
|
|
|
zthr_destroy(spa->spa_livelist_delete_zthr);
|
|
|
|
spa->spa_livelist_delete_zthr = NULL;
|
|
|
|
}
|
|
|
|
if (spa->spa_livelist_condense_zthr != NULL) {
|
|
|
|
zthr_destroy(spa->spa_livelist_condense_zthr);
|
|
|
|
spa->spa_livelist_condense_zthr = NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Opposite of spa_load().
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
spa_unload(spa_t *spa)
|
|
|
|
{
|
2008-12-03 23:09:06 +03:00
|
|
|
ASSERT(MUTEX_HELD(&spa_namespace_lock));
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
ASSERT(spa_state(spa) != POOL_STATE_UNINITIALIZED);
|
2008-12-03 23:09:06 +03:00
|
|
|
|
zpool import progress kstat
When an import requires a long MMP activity check, or when the user
requests pool recovery, the import make take a long time. The user may
not know why, or be able to tell whether the import is progressing or is
hung.
Add a kstat which lists all imports currently being processed by the
kernel (currently only one at a time is possible, but the kstat allows
for more than one). The kstat is /proc/spl/kstat/zfs/import_progress.
The kstat contents are as follows:
pool_guid load_state multihost_secs max_txg pool_name
16667015954387398 3 15 0 tank3
load_state: the value of spa_load_state
multihost_secs: seconds until the end of the multihost activity
check; if over, or none required, this is 0
max_txg: current spa_load_max_txg, if rewind is occurring
This could be used by outside tools, such as a pacemaker resource agent,
to report import progress, or as a part of manual troubleshooting. The
zpool import subcommand could also be modified to report this
information.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #8696
2019-05-09 20:08:05 +03:00
|
|
|
spa_import_progress_remove(spa_guid(spa));
|
2016-03-10 18:16:02 +03:00
|
|
|
spa_load_note(spa, "UNLOADING");
|
|
|
|
|
Add subcommand to wait for background zfs activity to complete
Currently the best way to wait for the completion of a long-running
operation in a pool, like a scrub or device removal, is to poll 'zpool
status' and parse its output, which is neither efficient nor convenient.
This change adds a 'wait' subcommand to the zpool command. When invoked,
'zpool wait' will block until a specified type of background activity
completes. Currently, this subcommand can wait for any of the following:
- Scrubs or resilvers to complete
- Devices to initialized
- Devices to be replaced
- Devices to be removed
- Checkpoints to be discarded
- Background freeing to complete
For example, a scrub that is in progress could be waited for by running
zpool wait -t scrub <pool>
This also adds a -w flag to the attach, checkpoint, initialize, replace,
remove, and scrub subcommands. When used, this flag makes the operations
kicked off by these subcommands synchronous instead of asynchronous.
This functionality is implemented using a new ioctl. The type of
activity to wait for is provided as input to the ioctl, and the ioctl
blocks until all activity of that type has completed. An ioctl was used
over other methods of kernel-userspace communiction primarily for the
sake of portability.
Porting Notes:
This is ported from Delphix OS change DLPX-44432. The following changes
were made while porting:
- Added ZoL-style ioctl input declaration.
- Reorganized error handling in zpool_initialize in libzfs to integrate
better with changes made for TRIM support.
- Fixed check for whether a checkpoint discard is in progress.
Previously it also waited if the pool had a checkpoint, instead of
just if a checkpoint was being discarded.
- Exposed zfs_initialize_chunk_size as a ZoL-style tunable.
- Updated more existing tests to make use of new 'zpool wait'
functionality, tests that don't exist in Delphix OS.
- Used existing ZoL tunable zfs_scan_suspend_progress, together with
zinject, in place of a new tunable zfs_scan_max_blks_per_txg.
- Added support for a non-integral interval argument to zpool wait.
Future work:
ZoL has support for trimming devices, which Delphix OS does not. In the
future, 'zpool wait' could be extended to add the ability to wait for
trim operations to complete.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #9162
2019-09-14 04:09:06 +03:00
|
|
|
spa_wake_waiters(spa);
|
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
/*
|
2022-02-16 02:48:59 +03:00
|
|
|
* If we have set the spa_final_txg, we have already performed the
|
|
|
|
* tasks below in spa_export_common(). We should not redo it here since
|
|
|
|
* we delay the final TXGs beyond what spa_final_txg is set at.
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
*/
|
2022-02-16 02:48:59 +03:00
|
|
|
if (spa->spa_final_txg == UINT64_MAX) {
|
|
|
|
/*
|
|
|
|
* If the log space map feature is enabled and the pool is
|
|
|
|
* getting exported (but not destroyed), we want to spend some
|
|
|
|
* time flushing as many metaslabs as we can in an attempt to
|
|
|
|
* destroy log space maps and save import time.
|
|
|
|
*/
|
|
|
|
if (spa_should_flush_logs_on_unload(spa))
|
|
|
|
spa_unload_log_sm_flush_all(spa);
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
|
2022-02-16 02:48:59 +03:00
|
|
|
/*
|
|
|
|
* Stop async tasks.
|
|
|
|
*/
|
|
|
|
spa_async_suspend(spa);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2022-02-16 02:48:59 +03:00
|
|
|
if (spa->spa_root_vdev) {
|
|
|
|
vdev_t *root_vdev = spa->spa_root_vdev;
|
|
|
|
vdev_initialize_stop_all(root_vdev,
|
|
|
|
VDEV_INITIALIZE_ACTIVE);
|
|
|
|
vdev_trim_stop_all(root_vdev, VDEV_TRIM_ACTIVE);
|
|
|
|
vdev_autotrim_stop_all(spa);
|
|
|
|
vdev_rebuild_stop_all(spa);
|
|
|
|
}
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Stop syncing.
|
|
|
|
*/
|
|
|
|
if (spa->spa_sync_on) {
|
|
|
|
txg_sync_stop(spa->spa_dsl_pool);
|
|
|
|
spa->spa_sync_on = B_FALSE;
|
|
|
|
}
|
|
|
|
|
2017-01-12 22:52:56 +03:00
|
|
|
/*
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
* This ensures that there is no async metaslab prefetching
|
|
|
|
* while we attempt to unload the spa.
|
2017-01-12 22:52:56 +03:00
|
|
|
*/
|
|
|
|
if (spa->spa_root_vdev != NULL) {
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
for (int c = 0; c < spa->spa_root_vdev->vdev_children; c++) {
|
|
|
|
vdev_t *vc = spa->spa_root_vdev->vdev_child[c];
|
|
|
|
if (vc->vdev_mg != NULL)
|
|
|
|
taskq_wait(vc->vdev_mg->mg_taskq);
|
|
|
|
}
|
2017-01-12 22:52:56 +03:00
|
|
|
}
|
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
if (spa->spa_mmp.mmp_thread)
|
|
|
|
mmp_thread_stop(spa);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* Wait for any outstanding async I/O to complete.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2009-07-03 02:44:48 +04:00
|
|
|
if (spa->spa_async_zio_root != NULL) {
|
2017-11-04 23:25:13 +03:00
|
|
|
for (int i = 0; i < max_ncpus; i++)
|
2014-09-17 10:59:43 +04:00
|
|
|
(void) zio_wait(spa->spa_async_zio_root[i]);
|
|
|
|
kmem_free(spa->spa_async_zio_root, max_ncpus * sizeof (void *));
|
2009-07-03 02:44:48 +04:00
|
|
|
spa->spa_async_zio_root = NULL;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
if (spa->spa_vdev_removal != NULL) {
|
|
|
|
spa_vdev_removal_destroy(spa->spa_vdev_removal);
|
|
|
|
spa->spa_vdev_removal = NULL;
|
|
|
|
}
|
|
|
|
|
2019-07-26 20:54:14 +03:00
|
|
|
spa_destroy_aux_threads(spa);
|
2016-12-17 01:11:29 +03:00
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
spa_condense_fini(spa);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
bpobj_close(&spa->spa_deferred_bpobj);
|
|
|
|
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
spa_config_enter(spa, SCL_ALL, spa, RW_WRITER);
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Close all vdevs.
|
|
|
|
*/
|
|
|
|
if (spa->spa_root_vdev)
|
|
|
|
vdev_free(spa->spa_root_vdev);
|
|
|
|
ASSERT(spa->spa_root_vdev == NULL);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Close the dsl pool.
|
|
|
|
*/
|
|
|
|
if (spa->spa_dsl_pool) {
|
|
|
|
dsl_pool_close(spa->spa_dsl_pool);
|
|
|
|
spa->spa_dsl_pool = NULL;
|
2010-05-29 00:45:14 +04:00
|
|
|
spa->spa_meta_objset = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
ddt_unload(spa);
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
spa_unload_log_sm_metadata(spa);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2009-01-16 00:59:39 +03:00
|
|
|
/*
|
|
|
|
* Drop and purge level 2 cache
|
|
|
|
*/
|
|
|
|
spa_l2cache_drop(spa);
|
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
for (int i = 0; i < spa->spa_spares.sav_count; i++)
|
2008-11-20 23:01:55 +03:00
|
|
|
vdev_free(spa->spa_spares.sav_vdevs[i]);
|
|
|
|
if (spa->spa_spares.sav_vdevs) {
|
|
|
|
kmem_free(spa->spa_spares.sav_vdevs,
|
|
|
|
spa->spa_spares.sav_count * sizeof (void *));
|
|
|
|
spa->spa_spares.sav_vdevs = NULL;
|
|
|
|
}
|
|
|
|
if (spa->spa_spares.sav_config) {
|
|
|
|
nvlist_free(spa->spa_spares.sav_config);
|
|
|
|
spa->spa_spares.sav_config = NULL;
|
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
spa->spa_spares.sav_count = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
for (int i = 0; i < spa->spa_l2cache.sav_count; i++) {
|
2012-04-08 21:23:08 +04:00
|
|
|
vdev_clear_stats(spa->spa_l2cache.sav_vdevs[i]);
|
2008-11-20 23:01:55 +03:00
|
|
|
vdev_free(spa->spa_l2cache.sav_vdevs[i]);
|
2012-04-08 21:23:08 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
if (spa->spa_l2cache.sav_vdevs) {
|
|
|
|
kmem_free(spa->spa_l2cache.sav_vdevs,
|
|
|
|
spa->spa_l2cache.sav_count * sizeof (void *));
|
|
|
|
spa->spa_l2cache.sav_vdevs = NULL;
|
|
|
|
}
|
|
|
|
if (spa->spa_l2cache.sav_config) {
|
|
|
|
nvlist_free(spa->spa_l2cache.sav_config);
|
|
|
|
spa->spa_l2cache.sav_config = NULL;
|
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
spa->spa_l2cache.sav_count = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
spa->spa_async_suspended = 0;
|
2009-01-16 00:59:39 +03:00
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
spa->spa_indirect_vdevs_loaded = B_FALSE;
|
|
|
|
|
2011-11-15 23:01:27 +04:00
|
|
|
if (spa->spa_comment != NULL) {
|
|
|
|
spa_strfree(spa->spa_comment);
|
|
|
|
spa->spa_comment = NULL;
|
|
|
|
}
|
2021-02-18 08:30:45 +03:00
|
|
|
if (spa->spa_compatibility != NULL) {
|
|
|
|
spa_strfree(spa->spa_compatibility);
|
|
|
|
spa->spa_compatibility = NULL;
|
|
|
|
}
|
2011-11-15 23:01:27 +04:00
|
|
|
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
spa_config_exit(spa, SCL_ALL, spa);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Load (or re-load) the current list of vdevs describing the active spares for
|
|
|
|
* this pool. When this is called, we have some form of basic information in
|
|
|
|
* 'spa_spares.sav_config'. We parse this into vdevs, try to open them, and
|
|
|
|
* then re-generate a more complete list including status information.
|
|
|
|
*/
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
void
|
2008-11-20 23:01:55 +03:00
|
|
|
spa_load_spares(spa_t *spa)
|
|
|
|
{
|
|
|
|
nvlist_t **spares;
|
|
|
|
uint_t nspares;
|
|
|
|
int i;
|
|
|
|
vdev_t *vd, *tvd;
|
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
#ifndef _KERNEL
|
|
|
|
/*
|
|
|
|
* zdb opens both the current state of the pool and the
|
|
|
|
* checkpointed state (if present), with a different spa_t.
|
|
|
|
*
|
|
|
|
* As spare vdevs are shared among open pools, we skip loading
|
|
|
|
* them when we load the checkpointed state of the pool.
|
|
|
|
*/
|
|
|
|
if (!spa_writeable(spa))
|
|
|
|
return;
|
|
|
|
#endif
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* First, close and free any existing spare vdevs.
|
|
|
|
*/
|
|
|
|
for (i = 0; i < spa->spa_spares.sav_count; i++) {
|
|
|
|
vd = spa->spa_spares.sav_vdevs[i];
|
|
|
|
|
|
|
|
/* Undo the call to spa_activate() below */
|
2008-12-03 23:09:06 +03:00
|
|
|
if ((tvd = spa_lookup_by_guid(spa, vd->vdev_guid,
|
|
|
|
B_FALSE)) != NULL && tvd->vdev_isspare)
|
2008-11-20 23:01:55 +03:00
|
|
|
spa_spare_remove(tvd);
|
|
|
|
vdev_close(vd);
|
|
|
|
vdev_free(vd);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (spa->spa_spares.sav_vdevs)
|
|
|
|
kmem_free(spa->spa_spares.sav_vdevs,
|
|
|
|
spa->spa_spares.sav_count * sizeof (void *));
|
|
|
|
|
|
|
|
if (spa->spa_spares.sav_config == NULL)
|
|
|
|
nspares = 0;
|
|
|
|
else
|
2021-10-27 02:15:38 +03:00
|
|
|
VERIFY0(nvlist_lookup_nvlist_array(spa->spa_spares.sav_config,
|
|
|
|
ZPOOL_CONFIG_SPARES, &spares, &nspares));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
spa->spa_spares.sav_count = (int)nspares;
|
|
|
|
spa->spa_spares.sav_vdevs = NULL;
|
|
|
|
|
|
|
|
if (nspares == 0)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Construct the array of vdevs, opening them to get status in the
|
|
|
|
* process. For each spare, there is potentially two different vdev_t
|
|
|
|
* structures associated with it: one in the list of spares (used only
|
|
|
|
* for basic validation purposes) and one in the active vdev
|
|
|
|
* configuration (if it's spared in). During this phase we open and
|
|
|
|
* validate each vdev on the spare list. If the vdev also exists in the
|
|
|
|
* active configuration, then we also mark this vdev as an active spare.
|
|
|
|
*/
|
2014-01-25 03:47:46 +04:00
|
|
|
spa->spa_spares.sav_vdevs = kmem_zalloc(nspares * sizeof (void *),
|
2014-11-21 03:09:39 +03:00
|
|
|
KM_SLEEP);
|
2008-11-20 23:01:55 +03:00
|
|
|
for (i = 0; i < spa->spa_spares.sav_count; i++) {
|
|
|
|
VERIFY(spa_config_parse(spa, &vd, spares[i], NULL, 0,
|
|
|
|
VDEV_ALLOC_SPARE) == 0);
|
|
|
|
ASSERT(vd != NULL);
|
|
|
|
|
|
|
|
spa->spa_spares.sav_vdevs[i] = vd;
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
if ((tvd = spa_lookup_by_guid(spa, vd->vdev_guid,
|
|
|
|
B_FALSE)) != NULL) {
|
2008-11-20 23:01:55 +03:00
|
|
|
if (!tvd->vdev_isspare)
|
|
|
|
spa_spare_add(tvd);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We only mark the spare active if we were successfully
|
|
|
|
* able to load the vdev. Otherwise, importing a pool
|
|
|
|
* with a bad active spare would result in strange
|
|
|
|
* behavior, because multiple pool would think the spare
|
|
|
|
* is actively in use.
|
|
|
|
*
|
|
|
|
* There is a vulnerability here to an equally bizarre
|
|
|
|
* circumstance, where a dead active spare is later
|
|
|
|
* brought back to life (onlined or otherwise). Given
|
|
|
|
* the rarity of this scenario, and the extra complexity
|
|
|
|
* it adds, we ignore the possibility.
|
|
|
|
*/
|
|
|
|
if (!vdev_is_dead(tvd))
|
|
|
|
spa_spare_activate(tvd);
|
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
vd->vdev_top = vd;
|
2009-07-03 02:44:48 +04:00
|
|
|
vd->vdev_aux = &spa->spa_spares;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (vdev_open(vd) != 0)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (vdev_validate_aux(vd) == 0)
|
|
|
|
spa_spare_add(vd);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Recompute the stashed list of spares, with status information
|
|
|
|
* this time.
|
|
|
|
*/
|
2021-10-27 02:15:38 +03:00
|
|
|
fnvlist_remove(spa->spa_spares.sav_config, ZPOOL_CONFIG_SPARES);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
spares = kmem_alloc(spa->spa_spares.sav_count * sizeof (void *),
|
2014-11-21 03:09:39 +03:00
|
|
|
KM_SLEEP);
|
2008-11-20 23:01:55 +03:00
|
|
|
for (i = 0; i < spa->spa_spares.sav_count; i++)
|
|
|
|
spares[i] = vdev_config_generate(spa,
|
2010-05-29 00:45:14 +04:00
|
|
|
spa->spa_spares.sav_vdevs[i], B_TRUE, VDEV_CONFIG_SPARE);
|
2021-10-27 02:15:38 +03:00
|
|
|
fnvlist_add_nvlist_array(spa->spa_spares.sav_config,
|
2021-12-07 04:19:13 +03:00
|
|
|
ZPOOL_CONFIG_SPARES, (const nvlist_t * const *)spares,
|
|
|
|
spa->spa_spares.sav_count);
|
2008-11-20 23:01:55 +03:00
|
|
|
for (i = 0; i < spa->spa_spares.sav_count; i++)
|
|
|
|
nvlist_free(spares[i]);
|
|
|
|
kmem_free(spares, spa->spa_spares.sav_count * sizeof (void *));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Load (or re-load) the current list of vdevs describing the active l2cache for
|
|
|
|
* this pool. When this is called, we have some form of basic information in
|
|
|
|
* 'spa_l2cache.sav_config'. We parse this into vdevs, try to open them, and
|
|
|
|
* then re-generate a more complete list including status information.
|
|
|
|
* Devices which are already active have their details maintained, and are
|
|
|
|
* not re-opened.
|
|
|
|
*/
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
void
|
2008-11-20 23:01:55 +03:00
|
|
|
spa_load_l2cache(spa_t *spa)
|
|
|
|
{
|
2017-11-19 02:21:09 +03:00
|
|
|
nvlist_t **l2cache = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint_t nl2cache;
|
|
|
|
int i, j, oldnvdevs;
|
2009-07-03 02:44:48 +04:00
|
|
|
uint64_t guid;
|
2013-02-11 10:21:05 +04:00
|
|
|
vdev_t *vd, **oldvdevs, **newvdevs;
|
2008-11-20 23:01:55 +03:00
|
|
|
spa_aux_vdev_t *sav = &spa->spa_l2cache;
|
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
#ifndef _KERNEL
|
|
|
|
/*
|
|
|
|
* zdb opens both the current state of the pool and the
|
|
|
|
* checkpointed state (if present), with a different spa_t.
|
|
|
|
*
|
|
|
|
* As L2 caches are part of the ARC which is shared among open
|
|
|
|
* pools, we skip loading them when we load the checkpointed
|
|
|
|
* state of the pool.
|
|
|
|
*/
|
|
|
|
if (!spa_writeable(spa))
|
|
|
|
return;
|
|
|
|
#endif
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
oldvdevs = sav->sav_vdevs;
|
|
|
|
oldnvdevs = sav->sav_count;
|
|
|
|
sav->sav_vdevs = NULL;
|
|
|
|
sav->sav_count = 0;
|
|
|
|
|
2016-09-10 23:06:17 +03:00
|
|
|
if (sav->sav_config == NULL) {
|
|
|
|
nl2cache = 0;
|
|
|
|
newvdevs = NULL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2021-10-27 02:15:38 +03:00
|
|
|
VERIFY0(nvlist_lookup_nvlist_array(sav->sav_config,
|
|
|
|
ZPOOL_CONFIG_L2CACHE, &l2cache, &nl2cache));
|
2016-09-10 23:06:17 +03:00
|
|
|
newvdevs = kmem_alloc(nl2cache * sizeof (void *), KM_SLEEP);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Process new nvlist of vdevs.
|
|
|
|
*/
|
|
|
|
for (i = 0; i < nl2cache; i++) {
|
2021-10-27 02:15:38 +03:00
|
|
|
guid = fnvlist_lookup_uint64(l2cache[i], ZPOOL_CONFIG_GUID);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
newvdevs[i] = NULL;
|
|
|
|
for (j = 0; j < oldnvdevs; j++) {
|
|
|
|
vd = oldvdevs[j];
|
|
|
|
if (vd != NULL && guid == vd->vdev_guid) {
|
|
|
|
/*
|
|
|
|
* Retain previous vdev for add/remove ops.
|
|
|
|
*/
|
|
|
|
newvdevs[i] = vd;
|
|
|
|
oldvdevs[j] = NULL;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (newvdevs[i] == NULL) {
|
|
|
|
/*
|
|
|
|
* Create new vdev
|
|
|
|
*/
|
|
|
|
VERIFY(spa_config_parse(spa, &vd, l2cache[i], NULL, 0,
|
|
|
|
VDEV_ALLOC_L2CACHE) == 0);
|
|
|
|
ASSERT(vd != NULL);
|
|
|
|
newvdevs[i] = vd;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Commit this vdev as an l2cache device,
|
|
|
|
* even if it fails to open.
|
|
|
|
*/
|
|
|
|
spa_l2cache_add(vd);
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
vd->vdev_top = vd;
|
|
|
|
vd->vdev_aux = sav;
|
|
|
|
|
|
|
|
spa_l2cache_activate(vd);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (vdev_open(vd) != 0)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
(void) vdev_validate_aux(vd);
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
if (!vdev_is_dead(vd))
|
|
|
|
l2arc_add_vdev(spa, vd);
|
2020-06-09 20:15:08 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Upon cache device addition to a pool or pool
|
|
|
|
* creation with a cache device or if the header
|
|
|
|
* of the device is invalid we issue an async
|
|
|
|
* TRIM command for the whole device which will
|
|
|
|
* execute if l2arc_trim_ahead > 0.
|
|
|
|
*/
|
|
|
|
spa_async_request(spa, SPA_ASYNC_L2CACHE_TRIM);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-09-10 23:06:17 +03:00
|
|
|
sav->sav_vdevs = newvdevs;
|
|
|
|
sav->sav_count = (int)nl2cache;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Recompute the stashed list of l2cache devices, with status
|
|
|
|
* information this time.
|
|
|
|
*/
|
2021-10-27 02:15:38 +03:00
|
|
|
fnvlist_remove(sav->sav_config, ZPOOL_CONFIG_L2CACHE);
|
2016-09-10 23:06:17 +03:00
|
|
|
|
2017-11-19 02:21:09 +03:00
|
|
|
if (sav->sav_count > 0)
|
|
|
|
l2cache = kmem_alloc(sav->sav_count * sizeof (void *),
|
|
|
|
KM_SLEEP);
|
2016-09-10 23:06:17 +03:00
|
|
|
for (i = 0; i < sav->sav_count; i++)
|
|
|
|
l2cache[i] = vdev_config_generate(spa,
|
|
|
|
sav->sav_vdevs[i], B_TRUE, VDEV_CONFIG_L2CACHE);
|
2021-12-07 04:19:13 +03:00
|
|
|
fnvlist_add_nvlist_array(sav->sav_config, ZPOOL_CONFIG_L2CACHE,
|
|
|
|
(const nvlist_t * const *)l2cache, sav->sav_count);
|
2016-09-10 23:06:17 +03:00
|
|
|
|
|
|
|
out:
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Purge vdevs that were dropped
|
|
|
|
*/
|
|
|
|
for (i = 0; i < oldnvdevs; i++) {
|
|
|
|
uint64_t pool;
|
|
|
|
|
|
|
|
vd = oldvdevs[i];
|
|
|
|
if (vd != NULL) {
|
2012-04-08 21:23:08 +04:00
|
|
|
ASSERT(vd->vdev_isl2cache);
|
|
|
|
|
2009-01-16 00:59:39 +03:00
|
|
|
if (spa_l2cache_exists(vd->vdev_guid, &pool) &&
|
|
|
|
pool != 0ULL && l2arc_vdev_present(vd))
|
2008-11-20 23:01:55 +03:00
|
|
|
l2arc_remove_vdev(vd);
|
2012-04-08 21:23:08 +04:00
|
|
|
vdev_clear_stats(vd);
|
|
|
|
vdev_free(vd);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (oldvdevs)
|
|
|
|
kmem_free(oldvdevs, oldnvdevs * sizeof (void *));
|
|
|
|
|
|
|
|
for (i = 0; i < sav->sav_count; i++)
|
|
|
|
nvlist_free(l2cache[i]);
|
|
|
|
if (sav->sav_count)
|
|
|
|
kmem_free(l2cache, sav->sav_count * sizeof (void *));
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
load_nvlist(spa_t *spa, uint64_t obj, nvlist_t **value)
|
|
|
|
{
|
|
|
|
dmu_buf_t *db;
|
|
|
|
char *packed = NULL;
|
|
|
|
size_t nvsize = 0;
|
|
|
|
int error;
|
|
|
|
*value = NULL;
|
|
|
|
|
2012-11-30 23:23:38 +04:00
|
|
|
error = dmu_bonus_hold(spa->spa_meta_objset, obj, FTAG, &db);
|
|
|
|
if (error)
|
|
|
|
return (error);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
nvsize = *(uint64_t *)db->db_data;
|
|
|
|
dmu_buf_rele(db, FTAG);
|
|
|
|
|
2015-02-05 23:43:37 +03:00
|
|
|
packed = vmem_alloc(nvsize, KM_SLEEP);
|
2009-07-03 02:44:48 +04:00
|
|
|
error = dmu_read(spa->spa_meta_objset, obj, 0, nvsize, packed,
|
|
|
|
DMU_READ_PREFETCH);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (error == 0)
|
|
|
|
error = nvlist_unpack(packed, nvsize, value, 0);
|
2015-02-05 23:43:37 +03:00
|
|
|
vmem_free(packed, nvsize);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
/*
|
|
|
|
* Concrete top-level vdevs that are not missing and are not logs. At every
|
|
|
|
* spa_sync we write new uberblocks to at least SPA_SYNC_MIN_VDEVS core tvds.
|
|
|
|
*/
|
|
|
|
static uint64_t
|
|
|
|
spa_healthy_core_tvds(spa_t *spa)
|
|
|
|
{
|
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
uint64_t tvds = 0;
|
|
|
|
|
|
|
|
for (uint64_t i = 0; i < rvd->vdev_children; i++) {
|
|
|
|
vdev_t *vd = rvd->vdev_child[i];
|
|
|
|
if (vd->vdev_islog)
|
|
|
|
continue;
|
|
|
|
if (vdev_is_concrete(vd) && !vdev_is_dead(vd))
|
|
|
|
tvds++;
|
|
|
|
}
|
|
|
|
|
|
|
|
return (tvds);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Checks to see if the given vdev could not be opened, in which case we post a
|
|
|
|
* sysevent to notify the autoreplace code that the device has been removed.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
spa_check_removed(vdev_t *vd)
|
|
|
|
{
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
for (uint64_t c = 0; c < vd->vdev_children; c++)
|
2008-11-20 23:01:55 +03:00
|
|
|
spa_check_removed(vd->vdev_child[c]);
|
|
|
|
|
2013-03-07 05:57:09 +04:00
|
|
|
if (vd->vdev_ops->vdev_op_leaf && vdev_is_dead(vd) &&
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
vdev_is_concrete(vd)) {
|
2016-07-28 01:29:15 +03:00
|
|
|
zfs_post_autoreplace(vd->vdev_spa, vd);
|
2017-05-30 21:39:17 +03:00
|
|
|
spa_event_notify(vd->vdev_spa, vd, NULL, ESC_ZFS_VDEV_CHECK);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
static int
|
|
|
|
spa_check_for_missing_logs(spa_t *spa)
|
2009-07-03 02:44:48 +04:00
|
|
|
{
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
2010-08-27 01:24:34 +04:00
|
|
|
* If we're doing a normal import, then build up any additional
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
* diagnostic information about missing log devices.
|
2010-08-27 01:24:34 +04:00
|
|
|
* We'll pass this up to the user for further processing.
|
2010-05-29 00:45:14 +04:00
|
|
|
*/
|
2010-08-27 01:24:34 +04:00
|
|
|
if (!(spa->spa_import_flags & ZFS_IMPORT_MISSING_LOG)) {
|
|
|
|
nvlist_t **child, *nv;
|
|
|
|
uint64_t idx = 0;
|
|
|
|
|
2016-09-22 04:09:00 +03:00
|
|
|
child = kmem_alloc(rvd->vdev_children * sizeof (nvlist_t *),
|
2014-11-21 03:09:39 +03:00
|
|
|
KM_SLEEP);
|
2021-10-27 02:15:38 +03:00
|
|
|
nv = fnvlist_alloc();
|
2010-08-27 01:24:34 +04:00
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
for (uint64_t c = 0; c < rvd->vdev_children; c++) {
|
2010-08-27 01:24:34 +04:00
|
|
|
vdev_t *tvd = rvd->vdev_child[c];
|
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
/*
|
|
|
|
* We consider a device as missing only if it failed
|
|
|
|
* to open (i.e. offline or faulted is not considered
|
|
|
|
* as missing).
|
|
|
|
*/
|
|
|
|
if (tvd->vdev_islog &&
|
|
|
|
tvd->vdev_state == VDEV_STATE_CANT_OPEN) {
|
|
|
|
child[idx++] = vdev_config_generate(spa, tvd,
|
|
|
|
B_FALSE, VDEV_CONFIG_MISSING);
|
|
|
|
}
|
2010-08-27 01:24:34 +04:00
|
|
|
}
|
2009-07-03 02:44:48 +04:00
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
if (idx > 0) {
|
2021-12-07 04:19:13 +03:00
|
|
|
fnvlist_add_nvlist_array(nv, ZPOOL_CONFIG_CHILDREN,
|
|
|
|
(const nvlist_t * const *)child, idx);
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
fnvlist_add_nvlist(spa->spa_load_info,
|
|
|
|
ZPOOL_CONFIG_MISSING_DEVICES, nv);
|
2010-08-27 01:24:34 +04:00
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
for (uint64_t i = 0; i < idx; i++)
|
2010-08-27 01:24:34 +04:00
|
|
|
nvlist_free(child[i]);
|
|
|
|
}
|
|
|
|
nvlist_free(nv);
|
|
|
|
kmem_free(child, rvd->vdev_children * sizeof (char **));
|
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
if (idx > 0) {
|
|
|
|
spa_load_failed(spa, "some log devices are missing");
|
2016-12-16 22:20:15 +03:00
|
|
|
vdev_dbgmsg_print_tree(rvd, 2);
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
return (SET_ERROR(ENXIO));
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
for (uint64_t c = 0; c < rvd->vdev_children; c++) {
|
|
|
|
vdev_t *tvd = rvd->vdev_child[c];
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
if (tvd->vdev_islog &&
|
|
|
|
tvd->vdev_state == VDEV_STATE_CANT_OPEN) {
|
2010-08-27 01:24:34 +04:00
|
|
|
spa_set_log_state(spa, SPA_LOG_CLEAR);
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
spa_load_note(spa, "some log devices are "
|
|
|
|
"missing, ZIL is dropped.");
|
2016-12-16 22:20:15 +03:00
|
|
|
vdev_dbgmsg_print_tree(rvd, 2);
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
break;
|
2016-04-11 23:16:57 +03:00
|
|
|
}
|
2010-08-27 01:24:34 +04:00
|
|
|
}
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
2016-04-11 23:16:57 +03:00
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
return (0);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* Check for missing log devices
|
|
|
|
*/
|
2013-09-04 16:00:57 +04:00
|
|
|
static boolean_t
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_check_logs(spa_t *spa)
|
|
|
|
{
|
2013-09-04 16:00:57 +04:00
|
|
|
boolean_t rv = B_FALSE;
|
2015-05-06 19:07:55 +03:00
|
|
|
dsl_pool_t *dp = spa_get_dsl(spa);
|
2013-09-04 16:00:57 +04:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
switch (spa->spa_log_state) {
|
2010-08-26 20:52:41 +04:00
|
|
|
default:
|
|
|
|
break;
|
2008-12-03 23:09:06 +03:00
|
|
|
case SPA_LOG_MISSING:
|
|
|
|
/* need to recheck in case slog has been restored */
|
|
|
|
case SPA_LOG_UNKNOWN:
|
2015-05-06 19:07:55 +03:00
|
|
|
rv = (dmu_objset_find_dp(dp, dp->dp_root_dir_obj,
|
|
|
|
zil_check_log_chain, NULL, DS_FIND_CHILDREN) != 0);
|
2013-09-04 16:00:57 +04:00
|
|
|
if (rv)
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_set_log_state(spa, SPA_LOG_MISSING);
|
2008-12-03 23:09:06 +03:00
|
|
|
break;
|
|
|
|
}
|
2013-09-04 16:00:57 +04:00
|
|
|
return (rv);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
|
|
|
|
Set aside a metaslab for ZIL blocks
Mixing ZIL and normal allocations has several problems:
1. The ZIL allocations are allocated, written to disk, and then a few
seconds later freed. This leaves behind holes (free segments) where the
ZIL blocks used to be, which increases fragmentation, which negatively
impacts performance.
2. When under moderate load, ZIL allocations are of 128KB. If the pool
is fairly fragmented, there may not be many free chunks of that size.
This causes ZFS to load more metaslabs to locate free segments of 128KB
or more. The loading happens synchronously (from zil_commit()), and can
take around a second even if the metaslab's spacemap is cached in the
ARC. All concurrent synchronous operations on this filesystem must wait
while the metaslab is loading. This can cause a significant performance
impact.
3. If the pool is very fragmented, there may be zero free chunks of
128KB or more. In this case, the ZIL falls back to txg_wait_synced(),
which has an enormous performance impact.
These problems can be eliminated by using a dedicated log device
("slog"), even one with the same performance characteristics as the
normal devices.
This change sets aside one metaslab from each top-level vdev that is
preferentially used for ZIL allocations (vdev_log_mg,
spa_embedded_log_class). From an allocation perspective, this is
similar to having a dedicated log device, and it eliminates the
above-mentioned performance problems.
Log (ZIL) blocks can be allocated from the following locations. Each
one is tried in order until the allocation succeeds:
1. dedicated log vdevs, aka "slog" (spa_log_class)
2. embedded slog metaslabs (spa_embedded_log_class)
3. other metaslabs in normal vdevs (spa_normal_class)
The space required for the embedded slog metaslabs is usually between
0.5% and 1.0% of the pool, and comes out of the existing 3.2% of "slop"
space that is not available for user data.
On an all-ssd system with 4TB storage, 87% fragmentation, 60% capacity,
and recordsize=8k, testing shows a ~50% performance increase on random
8k sync writes. On even more fragmented systems (which hit problem #3
above and call txg_wait_synced()), the performance improvement can be
arbitrarily large (>100x).
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11389
2021-01-22 02:12:54 +03:00
|
|
|
/*
|
|
|
|
* Passivate any log vdevs (note, does not apply to embedded log metaslabs).
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
static boolean_t
|
|
|
|
spa_passivate_log(spa_t *spa)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
boolean_t slog_found = B_FALSE;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(spa_config_held(spa, SCL_ALLOC, RW_WRITER));
|
2009-01-16 00:59:39 +03:00
|
|
|
|
2017-11-04 23:25:13 +03:00
|
|
|
for (int c = 0; c < rvd->vdev_children; c++) {
|
2010-05-29 00:45:14 +04:00
|
|
|
vdev_t *tvd = rvd->vdev_child[c];
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (tvd->vdev_islog) {
|
Set aside a metaslab for ZIL blocks
Mixing ZIL and normal allocations has several problems:
1. The ZIL allocations are allocated, written to disk, and then a few
seconds later freed. This leaves behind holes (free segments) where the
ZIL blocks used to be, which increases fragmentation, which negatively
impacts performance.
2. When under moderate load, ZIL allocations are of 128KB. If the pool
is fairly fragmented, there may not be many free chunks of that size.
This causes ZFS to load more metaslabs to locate free segments of 128KB
or more. The loading happens synchronously (from zil_commit()), and can
take around a second even if the metaslab's spacemap is cached in the
ARC. All concurrent synchronous operations on this filesystem must wait
while the metaslab is loading. This can cause a significant performance
impact.
3. If the pool is very fragmented, there may be zero free chunks of
128KB or more. In this case, the ZIL falls back to txg_wait_synced(),
which has an enormous performance impact.
These problems can be eliminated by using a dedicated log device
("slog"), even one with the same performance characteristics as the
normal devices.
This change sets aside one metaslab from each top-level vdev that is
preferentially used for ZIL allocations (vdev_log_mg,
spa_embedded_log_class). From an allocation perspective, this is
similar to having a dedicated log device, and it eliminates the
above-mentioned performance problems.
Log (ZIL) blocks can be allocated from the following locations. Each
one is tried in order until the allocation succeeds:
1. dedicated log vdevs, aka "slog" (spa_log_class)
2. embedded slog metaslabs (spa_embedded_log_class)
3. other metaslabs in normal vdevs (spa_normal_class)
The space required for the embedded slog metaslabs is usually between
0.5% and 1.0% of the pool, and comes out of the existing 3.2% of "slop"
space that is not available for user data.
On an all-ssd system with 4TB storage, 87% fragmentation, 60% capacity,
and recordsize=8k, testing shows a ~50% performance increase on random
8k sync writes. On even more fragmented systems (which hit problem #3
above and call txg_wait_synced()), the performance improvement can be
arbitrarily large (>100x).
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11389
2021-01-22 02:12:54 +03:00
|
|
|
ASSERT3P(tvd->vdev_log_mg, ==, NULL);
|
|
|
|
metaslab_group_passivate(tvd->vdev_mg);
|
2010-05-29 00:45:14 +04:00
|
|
|
slog_found = B_TRUE;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
return (slog_found);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Set aside a metaslab for ZIL blocks
Mixing ZIL and normal allocations has several problems:
1. The ZIL allocations are allocated, written to disk, and then a few
seconds later freed. This leaves behind holes (free segments) where the
ZIL blocks used to be, which increases fragmentation, which negatively
impacts performance.
2. When under moderate load, ZIL allocations are of 128KB. If the pool
is fairly fragmented, there may not be many free chunks of that size.
This causes ZFS to load more metaslabs to locate free segments of 128KB
or more. The loading happens synchronously (from zil_commit()), and can
take around a second even if the metaslab's spacemap is cached in the
ARC. All concurrent synchronous operations on this filesystem must wait
while the metaslab is loading. This can cause a significant performance
impact.
3. If the pool is very fragmented, there may be zero free chunks of
128KB or more. In this case, the ZIL falls back to txg_wait_synced(),
which has an enormous performance impact.
These problems can be eliminated by using a dedicated log device
("slog"), even one with the same performance characteristics as the
normal devices.
This change sets aside one metaslab from each top-level vdev that is
preferentially used for ZIL allocations (vdev_log_mg,
spa_embedded_log_class). From an allocation perspective, this is
similar to having a dedicated log device, and it eliminates the
above-mentioned performance problems.
Log (ZIL) blocks can be allocated from the following locations. Each
one is tried in order until the allocation succeeds:
1. dedicated log vdevs, aka "slog" (spa_log_class)
2. embedded slog metaslabs (spa_embedded_log_class)
3. other metaslabs in normal vdevs (spa_normal_class)
The space required for the embedded slog metaslabs is usually between
0.5% and 1.0% of the pool, and comes out of the existing 3.2% of "slop"
space that is not available for user data.
On an all-ssd system with 4TB storage, 87% fragmentation, 60% capacity,
and recordsize=8k, testing shows a ~50% performance increase on random
8k sync writes. On even more fragmented systems (which hit problem #3
above and call txg_wait_synced()), the performance improvement can be
arbitrarily large (>100x).
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11389
2021-01-22 02:12:54 +03:00
|
|
|
/*
|
|
|
|
* Activate any log vdevs (note, does not apply to embedded log metaslabs).
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
spa_activate_log(spa_t *spa)
|
|
|
|
{
|
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(spa_config_held(spa, SCL_ALLOC, RW_WRITER));
|
|
|
|
|
2017-11-04 23:25:13 +03:00
|
|
|
for (int c = 0; c < rvd->vdev_children; c++) {
|
2010-05-29 00:45:14 +04:00
|
|
|
vdev_t *tvd = rvd->vdev_child[c];
|
|
|
|
|
Set aside a metaslab for ZIL blocks
Mixing ZIL and normal allocations has several problems:
1. The ZIL allocations are allocated, written to disk, and then a few
seconds later freed. This leaves behind holes (free segments) where the
ZIL blocks used to be, which increases fragmentation, which negatively
impacts performance.
2. When under moderate load, ZIL allocations are of 128KB. If the pool
is fairly fragmented, there may not be many free chunks of that size.
This causes ZFS to load more metaslabs to locate free segments of 128KB
or more. The loading happens synchronously (from zil_commit()), and can
take around a second even if the metaslab's spacemap is cached in the
ARC. All concurrent synchronous operations on this filesystem must wait
while the metaslab is loading. This can cause a significant performance
impact.
3. If the pool is very fragmented, there may be zero free chunks of
128KB or more. In this case, the ZIL falls back to txg_wait_synced(),
which has an enormous performance impact.
These problems can be eliminated by using a dedicated log device
("slog"), even one with the same performance characteristics as the
normal devices.
This change sets aside one metaslab from each top-level vdev that is
preferentially used for ZIL allocations (vdev_log_mg,
spa_embedded_log_class). From an allocation perspective, this is
similar to having a dedicated log device, and it eliminates the
above-mentioned performance problems.
Log (ZIL) blocks can be allocated from the following locations. Each
one is tried in order until the allocation succeeds:
1. dedicated log vdevs, aka "slog" (spa_log_class)
2. embedded slog metaslabs (spa_embedded_log_class)
3. other metaslabs in normal vdevs (spa_normal_class)
The space required for the embedded slog metaslabs is usually between
0.5% and 1.0% of the pool, and comes out of the existing 3.2% of "slop"
space that is not available for user data.
On an all-ssd system with 4TB storage, 87% fragmentation, 60% capacity,
and recordsize=8k, testing shows a ~50% performance increase on random
8k sync writes. On even more fragmented systems (which hit problem #3
above and call txg_wait_synced()), the performance improvement can be
arbitrarily large (>100x).
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11389
2021-01-22 02:12:54 +03:00
|
|
|
if (tvd->vdev_islog) {
|
|
|
|
ASSERT3P(tvd->vdev_log_mg, ==, NULL);
|
|
|
|
metaslab_group_activate(tvd->vdev_mg);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
int
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
spa_reset_logs(spa_t *spa)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2013-09-04 16:00:57 +04:00
|
|
|
int error;
|
2009-07-03 02:44:48 +04:00
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
error = dmu_objset_find(spa_name(spa), zil_reset,
|
2013-09-04 16:00:57 +04:00
|
|
|
NULL, DS_FIND_CHILDREN);
|
|
|
|
if (error == 0) {
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* We successfully offlined the log device, sync out the
|
|
|
|
* current txg so that the "stubby" block can be removed
|
|
|
|
* by zil_sync().
|
|
|
|
*/
|
|
|
|
txg_wait_synced(spa->spa_dsl_pool, 0);
|
|
|
|
}
|
|
|
|
return (error);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
spa_aux_check_removed(spa_aux_vdev_t *sav)
|
|
|
|
{
|
2017-11-04 23:25:13 +03:00
|
|
|
for (int i = 0; i < sav->sav_count; i++)
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_check_removed(sav->sav_vdevs[i]);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
void
|
|
|
|
spa_claim_notify(zio_t *zio)
|
|
|
|
{
|
|
|
|
spa_t *spa = zio->io_spa;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (zio->io_error)
|
|
|
|
return;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_enter(&spa->spa_props_lock); /* any mutex will do */
|
|
|
|
if (spa->spa_claim_max_txg < zio->io_bp->blk_birth)
|
|
|
|
spa->spa_claim_max_txg = zio->io_bp->blk_birth;
|
|
|
|
mutex_exit(&spa->spa_props_lock);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
typedef struct spa_load_error {
|
2022-02-05 00:06:38 +03:00
|
|
|
boolean_t sle_verify_data;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t sle_meta_count;
|
|
|
|
uint64_t sle_data_count;
|
|
|
|
} spa_load_error_t;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
spa_load_verify_done(zio_t *zio)
|
|
|
|
{
|
|
|
|
blkptr_t *bp = zio->io_bp;
|
|
|
|
spa_load_error_t *sle = zio->io_private;
|
|
|
|
dmu_object_type_t type = BP_GET_TYPE(bp);
|
|
|
|
int error = zio->io_error;
|
2014-07-15 22:58:41 +04:00
|
|
|
spa_t *spa = zio->io_spa;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-07-22 18:52:49 +03:00
|
|
|
abd_free(zio->io_abd);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error) {
|
2012-12-14 03:24:15 +04:00
|
|
|
if ((BP_GET_LEVEL(bp) != 0 || DMU_OT_IS_METADATA(type)) &&
|
2010-05-29 00:45:14 +04:00
|
|
|
type != DMU_OT_INTENT_LOG)
|
2016-01-14 03:37:41 +03:00
|
|
|
atomic_inc_64(&sle->sle_meta_count);
|
2010-05-29 00:45:14 +04:00
|
|
|
else
|
2016-01-14 03:37:41 +03:00
|
|
|
atomic_inc_64(&sle->sle_data_count);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2014-07-15 22:58:41 +04:00
|
|
|
|
|
|
|
mutex_enter(&spa->spa_scrub_lock);
|
2019-08-13 17:11:57 +03:00
|
|
|
spa->spa_load_verify_bytes -= BP_GET_PSIZE(bp);
|
2014-07-15 22:58:41 +04:00
|
|
|
cv_broadcast(&spa->spa_scrub_io_cv);
|
|
|
|
mutex_exit(&spa->spa_scrub_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-07-15 22:58:41 +04:00
|
|
|
/*
|
2019-09-03 03:56:41 +03:00
|
|
|
* Maximum number of inflight bytes is the log2 fraction of the arc size.
|
2019-08-13 17:11:57 +03:00
|
|
|
* By default, we set it to 1/16th of the arc.
|
2014-07-15 22:58:41 +04:00
|
|
|
*/
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
static uint_t spa_load_verify_shift = 4;
|
2022-01-15 02:37:55 +03:00
|
|
|
static int spa_load_verify_metadata = B_TRUE;
|
|
|
|
static int spa_load_verify_data = B_TRUE;
|
2014-07-15 22:58:41 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static int
|
|
|
|
spa_load_verify_cb(spa_t *spa, zilog_t *zilog, const blkptr_t *bp,
|
2014-06-25 22:37:59 +04:00
|
|
|
const zbookmark_phys_t *zb, const dnode_phys_t *dnp, void *arg)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2022-02-05 00:06:38 +03:00
|
|
|
zio_t *rio = arg;
|
|
|
|
spa_load_error_t *sle = rio->io_private;
|
|
|
|
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) zilog, (void) dnp;
|
|
|
|
|
2014-07-15 22:58:41 +04:00
|
|
|
/*
|
|
|
|
* Note: normally this routine will not be called if
|
|
|
|
* spa_load_verify_metadata is not set. However, it may be useful
|
|
|
|
* to manually set the flag after the traversal has begun.
|
|
|
|
*/
|
|
|
|
if (!spa_load_verify_metadata)
|
|
|
|
return (0);
|
2022-05-20 20:36:14 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Sanity check the block pointer in order to detect obvious damage
|
|
|
|
* before using the contents in subsequent checks or in zio_read().
|
|
|
|
* When damaged consider it to be a metadata error since we cannot
|
|
|
|
* trust the BP_GET_TYPE and BP_GET_LEVEL values.
|
|
|
|
*/
|
|
|
|
if (!zfs_blkptr_verify(spa, bp, B_FALSE, BLK_VERIFY_LOG)) {
|
|
|
|
atomic_inc_64(&sle->sle_meta_count);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (zb->zb_level == ZB_DNODE_LEVEL || BP_IS_HOLE(bp) ||
|
|
|
|
BP_IS_EMBEDDED(bp) || BP_IS_REDACTED(bp))
|
|
|
|
return (0);
|
|
|
|
|
2022-02-05 00:06:38 +03:00
|
|
|
if (!BP_IS_METADATA(bp) &&
|
|
|
|
(!spa_load_verify_data || !sle->sle_verify_data))
|
2014-07-15 22:58:41 +04:00
|
|
|
return (0);
|
|
|
|
|
2019-09-03 05:17:51 +03:00
|
|
|
uint64_t maxinflight_bytes =
|
|
|
|
arc_target_bytes() >> spa_load_verify_shift;
|
2017-11-04 23:25:13 +03:00
|
|
|
size_t size = BP_GET_PSIZE(bp);
|
2014-07-15 22:58:41 +04:00
|
|
|
|
|
|
|
mutex_enter(&spa->spa_scrub_lock);
|
2019-08-13 17:11:57 +03:00
|
|
|
while (spa->spa_load_verify_bytes >= maxinflight_bytes)
|
2014-07-15 22:58:41 +04:00
|
|
|
cv_wait(&spa->spa_scrub_io_cv, &spa->spa_scrub_lock);
|
2019-08-13 17:11:57 +03:00
|
|
|
spa->spa_load_verify_bytes += size;
|
2014-07-15 22:58:41 +04:00
|
|
|
mutex_exit(&spa->spa_scrub_lock);
|
|
|
|
|
2016-07-22 18:52:49 +03:00
|
|
|
zio_nowait(zio_read(rio, spa, bp, abd_alloc_for_io(size, B_FALSE), size,
|
2014-07-15 22:58:41 +04:00
|
|
|
spa_load_verify_done, rio->io_private, ZIO_PRIORITY_SCRUB,
|
|
|
|
ZIO_FLAG_SPECULATIVE | ZIO_FLAG_CANFAIL |
|
|
|
|
ZIO_FLAG_SCRUB | ZIO_FLAG_RAW, zb));
|
2010-05-29 00:45:14 +04:00
|
|
|
return (0);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2020-06-15 21:30:37 +03:00
|
|
|
static int
|
2016-06-16 00:51:27 +03:00
|
|
|
verify_dataset_name_len(dsl_pool_t *dp, dsl_dataset_t *ds, void *arg)
|
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) dp, (void) arg;
|
|
|
|
|
2016-06-16 00:51:27 +03:00
|
|
|
if (dsl_dataset_namelen(ds) >= ZFS_MAX_DATASET_NAME_LEN)
|
|
|
|
return (SET_ERROR(ENAMETOOLONG));
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static int
|
|
|
|
spa_load_verify(spa_t *spa)
|
|
|
|
{
|
|
|
|
zio_t *rio;
|
|
|
|
spa_load_error_t sle = { 0 };
|
2017-02-11 01:51:09 +03:00
|
|
|
zpool_load_policy_t policy;
|
2010-05-29 00:45:14 +04:00
|
|
|
boolean_t verify_ok = B_FALSE;
|
2014-07-15 22:58:41 +04:00
|
|
|
int error = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-02-11 01:51:09 +03:00
|
|
|
zpool_get_load_policy(spa->spa_config, &policy);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2022-02-05 00:06:38 +03:00
|
|
|
if (policy.zlp_rewind & ZPOOL_NEVER_REWIND ||
|
|
|
|
policy.zlp_maxmeta == UINT64_MAX)
|
2010-05-29 00:45:14 +04:00
|
|
|
return (0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-16 00:51:27 +03:00
|
|
|
dsl_pool_config_enter(spa->spa_dsl_pool, FTAG);
|
|
|
|
error = dmu_objset_find_dp(spa->spa_dsl_pool,
|
|
|
|
spa->spa_dsl_pool->dp_root_dir_obj, verify_dataset_name_len, NULL,
|
|
|
|
DS_FIND_CHILDREN);
|
|
|
|
dsl_pool_config_exit(spa->spa_dsl_pool, FTAG);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
|
2022-02-05 00:06:38 +03:00
|
|
|
/*
|
|
|
|
* Verify data only if we are rewinding or error limit was set.
|
|
|
|
* Otherwise nothing except dbgmsg care about it to waste time.
|
|
|
|
*/
|
|
|
|
sle.sle_verify_data = (policy.zlp_rewind & ZPOOL_REWIND_MASK) ||
|
|
|
|
(policy.zlp_maxdata < UINT64_MAX);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
rio = zio_root(spa, NULL, &sle,
|
|
|
|
ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-07-15 22:58:41 +04:00
|
|
|
if (spa_load_verify_metadata) {
|
2016-03-10 18:16:02 +03:00
|
|
|
if (spa->spa_extreme_rewind) {
|
|
|
|
spa_load_note(spa, "performing a complete scan of the "
|
|
|
|
"pool since extreme rewind is on. This may take "
|
|
|
|
"a very long time.\n (spa_load_verify_data=%u, "
|
|
|
|
"spa_load_verify_metadata=%u)",
|
|
|
|
spa_load_verify_data, spa_load_verify_metadata);
|
|
|
|
}
|
2019-08-13 17:11:57 +03:00
|
|
|
|
2014-07-15 22:58:41 +04:00
|
|
|
error = traverse_pool(spa, spa->spa_verify_min_txg,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
TRAVERSE_PRE | TRAVERSE_PREFETCH_METADATA |
|
|
|
|
TRAVERSE_NO_DECRYPT, spa_load_verify_cb, rio);
|
2014-07-15 22:58:41 +04:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
(void) zio_wait(rio);
|
2019-08-13 17:11:57 +03:00
|
|
|
ASSERT0(spa->spa_load_verify_bytes);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
spa->spa_load_meta_errors = sle.sle_meta_count;
|
|
|
|
spa->spa_load_data_errors = sle.sle_data_count;
|
|
|
|
|
2018-01-31 02:25:19 +03:00
|
|
|
if (sle.sle_meta_count != 0 || sle.sle_data_count != 0) {
|
|
|
|
spa_load_note(spa, "spa_load_verify found %llu metadata errors "
|
|
|
|
"and %llu data errors", (u_longlong_t)sle.sle_meta_count,
|
|
|
|
(u_longlong_t)sle.sle_data_count);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (spa_load_verify_dryrun ||
|
2017-02-11 01:51:09 +03:00
|
|
|
(!error && sle.sle_meta_count <= policy.zlp_maxmeta &&
|
|
|
|
sle.sle_data_count <= policy.zlp_maxdata)) {
|
2010-08-27 01:24:34 +04:00
|
|
|
int64_t loss = 0;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
verify_ok = B_TRUE;
|
|
|
|
spa->spa_load_txg = spa->spa_uberblock.ub_txg;
|
|
|
|
spa->spa_load_txg_ts = spa->spa_uberblock.ub_timestamp;
|
2010-08-27 01:24:34 +04:00
|
|
|
|
|
|
|
loss = spa->spa_last_ubsync_txg_ts - spa->spa_load_txg_ts;
|
2021-10-27 02:15:38 +03:00
|
|
|
fnvlist_add_uint64(spa->spa_load_info, ZPOOL_CONFIG_LOAD_TIME,
|
|
|
|
spa->spa_load_txg_ts);
|
|
|
|
fnvlist_add_int64(spa->spa_load_info, ZPOOL_CONFIG_REWIND_TIME,
|
|
|
|
loss);
|
2022-02-05 00:06:38 +03:00
|
|
|
fnvlist_add_uint64(spa->spa_load_info,
|
|
|
|
ZPOOL_CONFIG_LOAD_META_ERRORS, sle.sle_meta_count);
|
2021-10-27 02:15:38 +03:00
|
|
|
fnvlist_add_uint64(spa->spa_load_info,
|
|
|
|
ZPOOL_CONFIG_LOAD_DATA_ERRORS, sle.sle_data_count);
|
2010-05-29 00:45:14 +04:00
|
|
|
} else {
|
|
|
|
spa->spa_load_max_txg = spa->spa_uberblock.ub_txg;
|
|
|
|
}
|
|
|
|
|
2018-01-31 02:25:19 +03:00
|
|
|
if (spa_load_verify_dryrun)
|
|
|
|
return (0);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error) {
|
|
|
|
if (error != ENXIO && error != EIO)
|
2013-03-08 22:41:28 +04:00
|
|
|
error = SET_ERROR(EIO);
|
2010-05-29 00:45:14 +04:00
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (verify_ok ? 0 : EIO);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Find a value in the pool props object.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
spa_prop_find(spa_t *spa, zpool_prop_t prop, uint64_t *val)
|
|
|
|
{
|
|
|
|
(void) zap_lookup(spa->spa_meta_objset, spa->spa_pool_props_object,
|
|
|
|
zpool_prop_to_name(prop), sizeof (uint64_t), 1, val);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Find a value in the pool directory object.
|
|
|
|
*/
|
|
|
|
static int
|
2016-03-10 18:16:02 +03:00
|
|
|
spa_dir_prop(spa_t *spa, const char *name, uint64_t *val, boolean_t log_enoent)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2016-03-10 18:16:02 +03:00
|
|
|
int error = zap_lookup(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
|
|
|
|
name, sizeof (uint64_t), 1, val);
|
|
|
|
|
|
|
|
if (error != 0 && (error != ENOENT || log_enoent)) {
|
|
|
|
spa_load_failed(spa, "couldn't get '%s' value in MOS directory "
|
|
|
|
"[error=%d]", name, error);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (error);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
spa_vdev_err(vdev_t *vdev, vdev_aux_t aux, int err)
|
|
|
|
{
|
|
|
|
vdev_set_state(vdev, B_TRUE, VDEV_STATE_CANT_OPEN, aux);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
return (SET_ERROR(err));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2019-07-26 20:54:14 +03:00
|
|
|
boolean_t
|
|
|
|
spa_livelist_delete_check(spa_t *spa)
|
|
|
|
{
|
|
|
|
return (spa->spa_livelists_to_delete != 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static boolean_t
|
|
|
|
spa_livelist_delete_cb_check(void *arg, zthr_t *z)
|
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) z;
|
2019-07-26 20:54:14 +03:00
|
|
|
spa_t *spa = arg;
|
|
|
|
return (spa_livelist_delete_check(spa));
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
delete_blkptr_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
spa_t *spa = arg;
|
|
|
|
zio_free(spa, tx->tx_txg, bp);
|
|
|
|
dsl_dir_diduse_space(tx->tx_pool->dp_free_dir, DD_USED_HEAD,
|
|
|
|
-bp_get_dsize_sync(spa, bp),
|
|
|
|
-BP_GET_PSIZE(bp), -BP_GET_UCSIZE(bp), tx);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
dsl_get_next_livelist_obj(objset_t *os, uint64_t zap_obj, uint64_t *llp)
|
|
|
|
{
|
|
|
|
int err;
|
|
|
|
zap_cursor_t zc;
|
|
|
|
zap_attribute_t za;
|
|
|
|
zap_cursor_init(&zc, os, zap_obj);
|
|
|
|
err = zap_cursor_retrieve(&zc, &za);
|
|
|
|
zap_cursor_fini(&zc);
|
|
|
|
if (err == 0)
|
|
|
|
*llp = za.za_first_integer;
|
|
|
|
return (err);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Components of livelist deletion that must be performed in syncing
|
|
|
|
* context: freeing block pointers and updating the pool-wide data
|
|
|
|
* structures to indicate how much work is left to do
|
|
|
|
*/
|
|
|
|
typedef struct sublist_delete_arg {
|
|
|
|
spa_t *spa;
|
|
|
|
dsl_deadlist_t *ll;
|
|
|
|
uint64_t key;
|
|
|
|
bplist_t *to_free;
|
|
|
|
} sublist_delete_arg_t;
|
|
|
|
|
|
|
|
static void
|
|
|
|
sublist_delete_sync(void *arg, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
sublist_delete_arg_t *sda = arg;
|
|
|
|
spa_t *spa = sda->spa;
|
|
|
|
dsl_deadlist_t *ll = sda->ll;
|
|
|
|
uint64_t key = sda->key;
|
|
|
|
bplist_t *to_free = sda->to_free;
|
|
|
|
|
|
|
|
bplist_iterate(to_free, delete_blkptr_cb, spa, tx);
|
|
|
|
dsl_deadlist_remove_entry(ll, key, tx);
|
|
|
|
}
|
|
|
|
|
|
|
|
typedef struct livelist_delete_arg {
|
|
|
|
spa_t *spa;
|
|
|
|
uint64_t ll_obj;
|
|
|
|
uint64_t zap_obj;
|
|
|
|
} livelist_delete_arg_t;
|
|
|
|
|
|
|
|
static void
|
|
|
|
livelist_delete_sync(void *arg, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
livelist_delete_arg_t *lda = arg;
|
|
|
|
spa_t *spa = lda->spa;
|
|
|
|
uint64_t ll_obj = lda->ll_obj;
|
|
|
|
uint64_t zap_obj = lda->zap_obj;
|
|
|
|
objset_t *mos = spa->spa_meta_objset;
|
|
|
|
uint64_t count;
|
|
|
|
|
|
|
|
/* free the livelist and decrement the feature count */
|
|
|
|
VERIFY0(zap_remove_int(mos, zap_obj, ll_obj, tx));
|
|
|
|
dsl_deadlist_free(mos, ll_obj, tx);
|
|
|
|
spa_feature_decr(spa, SPA_FEATURE_LIVELIST, tx);
|
|
|
|
VERIFY0(zap_count(mos, zap_obj, &count));
|
|
|
|
if (count == 0) {
|
|
|
|
/* no more livelists to delete */
|
|
|
|
VERIFY0(zap_remove(mos, DMU_POOL_DIRECTORY_OBJECT,
|
|
|
|
DMU_POOL_DELETED_CLONES, tx));
|
|
|
|
VERIFY0(zap_destroy(mos, zap_obj, tx));
|
|
|
|
spa->spa_livelists_to_delete = 0;
|
Add subcommand to wait for background zfs activity to complete
Currently the best way to wait for the completion of a long-running
operation in a pool, like a scrub or device removal, is to poll 'zpool
status' and parse its output, which is neither efficient nor convenient.
This change adds a 'wait' subcommand to the zpool command. When invoked,
'zpool wait' will block until a specified type of background activity
completes. Currently, this subcommand can wait for any of the following:
- Scrubs or resilvers to complete
- Devices to initialized
- Devices to be replaced
- Devices to be removed
- Checkpoints to be discarded
- Background freeing to complete
For example, a scrub that is in progress could be waited for by running
zpool wait -t scrub <pool>
This also adds a -w flag to the attach, checkpoint, initialize, replace,
remove, and scrub subcommands. When used, this flag makes the operations
kicked off by these subcommands synchronous instead of asynchronous.
This functionality is implemented using a new ioctl. The type of
activity to wait for is provided as input to the ioctl, and the ioctl
blocks until all activity of that type has completed. An ioctl was used
over other methods of kernel-userspace communiction primarily for the
sake of portability.
Porting Notes:
This is ported from Delphix OS change DLPX-44432. The following changes
were made while porting:
- Added ZoL-style ioctl input declaration.
- Reorganized error handling in zpool_initialize in libzfs to integrate
better with changes made for TRIM support.
- Fixed check for whether a checkpoint discard is in progress.
Previously it also waited if the pool had a checkpoint, instead of
just if a checkpoint was being discarded.
- Exposed zfs_initialize_chunk_size as a ZoL-style tunable.
- Updated more existing tests to make use of new 'zpool wait'
functionality, tests that don't exist in Delphix OS.
- Used existing ZoL tunable zfs_scan_suspend_progress, together with
zinject, in place of a new tunable zfs_scan_max_blks_per_txg.
- Added support for a non-integral interval argument to zpool wait.
Future work:
ZoL has support for trimming devices, which Delphix OS does not. In the
future, 'zpool wait' could be extended to add the ability to wait for
trim operations to complete.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #9162
2019-09-14 04:09:06 +03:00
|
|
|
spa_notify_waiters(spa);
|
2019-07-26 20:54:14 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Load in the value for the livelist to be removed and open it. Then,
|
|
|
|
* load its first sublist and determine which block pointers should actually
|
|
|
|
* be freed. Then, call a synctask which performs the actual frees and updates
|
|
|
|
* the pool-wide livelist data.
|
|
|
|
*/
|
2020-06-15 21:30:37 +03:00
|
|
|
static void
|
2019-07-26 20:54:14 +03:00
|
|
|
spa_livelist_delete_cb(void *arg, zthr_t *z)
|
|
|
|
{
|
|
|
|
spa_t *spa = arg;
|
|
|
|
uint64_t ll_obj = 0, count;
|
|
|
|
objset_t *mos = spa->spa_meta_objset;
|
|
|
|
uint64_t zap_obj = spa->spa_livelists_to_delete;
|
|
|
|
/*
|
|
|
|
* Determine the next livelist to delete. This function should only
|
|
|
|
* be called if there is at least one deleted clone.
|
|
|
|
*/
|
|
|
|
VERIFY0(dsl_get_next_livelist_obj(mos, zap_obj, &ll_obj));
|
|
|
|
VERIFY0(zap_count(mos, ll_obj, &count));
|
|
|
|
if (count > 0) {
|
2020-09-03 18:38:16 +03:00
|
|
|
dsl_deadlist_t *ll;
|
2019-07-26 20:54:14 +03:00
|
|
|
dsl_deadlist_entry_t *dle;
|
|
|
|
bplist_t to_free;
|
2020-09-03 18:38:16 +03:00
|
|
|
ll = kmem_zalloc(sizeof (dsl_deadlist_t), KM_SLEEP);
|
|
|
|
dsl_deadlist_open(ll, mos, ll_obj);
|
|
|
|
dle = dsl_deadlist_first(ll);
|
2019-07-26 20:54:14 +03:00
|
|
|
ASSERT3P(dle, !=, NULL);
|
|
|
|
bplist_create(&to_free);
|
|
|
|
int err = dsl_process_sub_livelist(&dle->dle_bpobj, &to_free,
|
|
|
|
z, NULL);
|
|
|
|
if (err == 0) {
|
|
|
|
sublist_delete_arg_t sync_arg = {
|
|
|
|
.spa = spa,
|
2020-09-03 18:38:16 +03:00
|
|
|
.ll = ll,
|
2019-07-26 20:54:14 +03:00
|
|
|
.key = dle->dle_mintxg,
|
|
|
|
.to_free = &to_free
|
|
|
|
};
|
|
|
|
zfs_dbgmsg("deleting sublist (id %llu) from"
|
2021-06-23 07:53:45 +03:00
|
|
|
" livelist %llu, %lld remaining",
|
|
|
|
(u_longlong_t)dle->dle_bpobj.bpo_object,
|
|
|
|
(u_longlong_t)ll_obj, (longlong_t)count - 1);
|
2019-07-26 20:54:14 +03:00
|
|
|
VERIFY0(dsl_sync_task(spa_name(spa), NULL,
|
|
|
|
sublist_delete_sync, &sync_arg, 0,
|
|
|
|
ZFS_SPACE_CHECK_DESTROY));
|
|
|
|
} else {
|
Fix i/o error handling of livelists and zap iteration
Pool-wide metadata is stored in the MOS (Meta Object Set). This
metadata is stored in triplicate, in addition to any pool-level
reduncancy (e.g. RAIDZ). However, if all 3+ copies of this metadata are
not available, we can still get EIO/ECKSUM when reading from the MOS.
If we encounter such an error in syncing context, we have typically
already committed to making a change that we now can't do because of the
corrupt/missing metadata. We typically "handle" this with a `VERIFY()`
or `zfs_panic_recover()`. This prevents the system from continuing on
in an undefined state, while minimizing the amount of error-handling
code.
However, there are some code paths that ignore these i/o errors, or
`ASSERT()` that they don't happen. Since assertions are disabled on
non-debug builds, they effectively ignore them as well. This can lead
to ZFS continuing on in an incorrect state, potentially leading to
on-disk inconsistencies.
This commit adds handling for these i/o errors on MOS metadata,
typically with a `VERIFY()`:
* Handle error return from `zap_cursor_retrieve()` in 4 places in
`dsl_deadlist.c`.
* Handle error return from `zap_contains()` in `dsl_dir_hold_obj()`.
Turns out this call isn't necessary because we can always call
`zap_lookup()`.
* Handle error return from `zap_lookup()` in `dsl_fs_ss_limit_check()`.
* Handle error return from `zap_remove()` in `dsl_dir_rename_sync()`.
* Handle error return from `zap_lookup()` in
`dsl_dir_remove_livelist()`.
* Handle error return from `dsl_process_sub_livelist()` in
`spa_livelist_delete_cb()`.
Additionally:
* Augment the internal history log message for `zfs destroy` to note
which method is used (e.g. bptree, livelist, or, synchronous) and the
mintxg.
* Correct a comment in `dbuf_init()`.
* Correct indentation in `dsl_dir_remove_livelist()`.
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10643
2020-08-05 20:22:09 +03:00
|
|
|
VERIFY3U(err, ==, EINTR);
|
2019-07-26 20:54:14 +03:00
|
|
|
}
|
|
|
|
bplist_clear(&to_free);
|
|
|
|
bplist_destroy(&to_free);
|
2020-09-03 18:38:16 +03:00
|
|
|
dsl_deadlist_close(ll);
|
|
|
|
kmem_free(ll, sizeof (dsl_deadlist_t));
|
2019-07-26 20:54:14 +03:00
|
|
|
} else {
|
|
|
|
livelist_delete_arg_t sync_arg = {
|
|
|
|
.spa = spa,
|
|
|
|
.ll_obj = ll_obj,
|
|
|
|
.zap_obj = zap_obj
|
|
|
|
};
|
2021-06-23 07:53:45 +03:00
|
|
|
zfs_dbgmsg("deletion of livelist %llu completed",
|
|
|
|
(u_longlong_t)ll_obj);
|
2019-07-26 20:54:14 +03:00
|
|
|
VERIFY0(dsl_sync_task(spa_name(spa), NULL, livelist_delete_sync,
|
|
|
|
&sync_arg, 0, ZFS_SPACE_CHECK_DESTROY));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-06-15 21:30:37 +03:00
|
|
|
static void
|
2019-07-26 20:54:14 +03:00
|
|
|
spa_start_livelist_destroy_thread(spa_t *spa)
|
|
|
|
{
|
|
|
|
ASSERT3P(spa->spa_livelist_delete_zthr, ==, NULL);
|
2020-07-29 19:43:33 +03:00
|
|
|
spa->spa_livelist_delete_zthr =
|
|
|
|
zthr_create("z_livelist_destroy",
|
2021-08-10 20:36:26 +03:00
|
|
|
spa_livelist_delete_cb_check, spa_livelist_delete_cb, spa,
|
|
|
|
minclsyspri);
|
2019-07-26 20:54:14 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
typedef struct livelist_new_arg {
|
|
|
|
bplist_t *allocs;
|
|
|
|
bplist_t *frees;
|
|
|
|
} livelist_new_arg_t;
|
|
|
|
|
|
|
|
static int
|
|
|
|
livelist_track_new_cb(void *arg, const blkptr_t *bp, boolean_t bp_freed,
|
|
|
|
dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
ASSERT(tx == NULL);
|
|
|
|
livelist_new_arg_t *lna = arg;
|
|
|
|
if (bp_freed) {
|
|
|
|
bplist_append(lna->frees, bp);
|
|
|
|
} else {
|
|
|
|
bplist_append(lna->allocs, bp);
|
|
|
|
zfs_livelist_condense_new_alloc++;
|
|
|
|
}
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
typedef struct livelist_condense_arg {
|
|
|
|
spa_t *spa;
|
|
|
|
bplist_t to_keep;
|
|
|
|
uint64_t first_size;
|
|
|
|
uint64_t next_size;
|
|
|
|
} livelist_condense_arg_t;
|
|
|
|
|
|
|
|
static void
|
|
|
|
spa_livelist_condense_sync(void *arg, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
livelist_condense_arg_t *lca = arg;
|
|
|
|
spa_t *spa = lca->spa;
|
|
|
|
bplist_t new_frees;
|
|
|
|
dsl_dataset_t *ds = spa->spa_to_condense.ds;
|
|
|
|
|
|
|
|
/* Have we been cancelled? */
|
|
|
|
if (spa->spa_to_condense.cancelled) {
|
|
|
|
zfs_livelist_condense_sync_cancel++;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
dsl_deadlist_entry_t *first = spa->spa_to_condense.first;
|
|
|
|
dsl_deadlist_entry_t *next = spa->spa_to_condense.next;
|
|
|
|
dsl_deadlist_t *ll = &ds->ds_dir->dd_livelist;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* It's possible that the livelist was changed while the zthr was
|
|
|
|
* running. Therefore, we need to check for new blkptrs in the two
|
|
|
|
* entries being condensed and continue to track them in the livelist.
|
|
|
|
* Because of the way we handle remapped blkptrs (see dbuf_remap_impl),
|
|
|
|
* it's possible that the newly added blkptrs are FREEs or ALLOCs so
|
|
|
|
* we need to sort them into two different bplists.
|
|
|
|
*/
|
|
|
|
uint64_t first_obj = first->dle_bpobj.bpo_object;
|
|
|
|
uint64_t next_obj = next->dle_bpobj.bpo_object;
|
|
|
|
uint64_t cur_first_size = first->dle_bpobj.bpo_phys->bpo_num_blkptrs;
|
|
|
|
uint64_t cur_next_size = next->dle_bpobj.bpo_phys->bpo_num_blkptrs;
|
|
|
|
|
|
|
|
bplist_create(&new_frees);
|
|
|
|
livelist_new_arg_t new_bps = {
|
|
|
|
.allocs = &lca->to_keep,
|
|
|
|
.frees = &new_frees,
|
|
|
|
};
|
|
|
|
|
|
|
|
if (cur_first_size > lca->first_size) {
|
|
|
|
VERIFY0(livelist_bpobj_iterate_from_nofree(&first->dle_bpobj,
|
|
|
|
livelist_track_new_cb, &new_bps, lca->first_size));
|
|
|
|
}
|
|
|
|
if (cur_next_size > lca->next_size) {
|
|
|
|
VERIFY0(livelist_bpobj_iterate_from_nofree(&next->dle_bpobj,
|
|
|
|
livelist_track_new_cb, &new_bps, lca->next_size));
|
|
|
|
}
|
|
|
|
|
|
|
|
dsl_deadlist_clear_entry(first, ll, tx);
|
|
|
|
ASSERT(bpobj_is_empty(&first->dle_bpobj));
|
|
|
|
dsl_deadlist_remove_entry(ll, next->dle_mintxg, tx);
|
|
|
|
|
|
|
|
bplist_iterate(&lca->to_keep, dsl_deadlist_insert_alloc_cb, ll, tx);
|
|
|
|
bplist_iterate(&new_frees, dsl_deadlist_insert_free_cb, ll, tx);
|
|
|
|
bplist_destroy(&new_frees);
|
|
|
|
|
|
|
|
char dsname[ZFS_MAX_DATASET_NAME_LEN];
|
|
|
|
dsl_dataset_name(ds, dsname);
|
|
|
|
zfs_dbgmsg("txg %llu condensing livelist of %s (id %llu), bpobj %llu "
|
|
|
|
"(%llu blkptrs) and bpobj %llu (%llu blkptrs) -> bpobj %llu "
|
2021-06-23 07:53:45 +03:00
|
|
|
"(%llu blkptrs)", (u_longlong_t)tx->tx_txg, dsname,
|
|
|
|
(u_longlong_t)ds->ds_object, (u_longlong_t)first_obj,
|
|
|
|
(u_longlong_t)cur_first_size, (u_longlong_t)next_obj,
|
|
|
|
(u_longlong_t)cur_next_size,
|
|
|
|
(u_longlong_t)first->dle_bpobj.bpo_object,
|
|
|
|
(u_longlong_t)first->dle_bpobj.bpo_phys->bpo_num_blkptrs);
|
2019-07-26 20:54:14 +03:00
|
|
|
out:
|
|
|
|
dmu_buf_rele(ds->ds_dbuf, spa);
|
|
|
|
spa->spa_to_condense.ds = NULL;
|
|
|
|
bplist_clear(&lca->to_keep);
|
|
|
|
bplist_destroy(&lca->to_keep);
|
|
|
|
kmem_free(lca, sizeof (livelist_condense_arg_t));
|
|
|
|
spa->spa_to_condense.syncing = B_FALSE;
|
|
|
|
}
|
|
|
|
|
2020-06-15 21:30:37 +03:00
|
|
|
static void
|
2019-07-26 20:54:14 +03:00
|
|
|
spa_livelist_condense_cb(void *arg, zthr_t *t)
|
|
|
|
{
|
|
|
|
while (zfs_livelist_condense_zthr_pause &&
|
|
|
|
!(zthr_has_waiters(t) || zthr_iscancelled(t)))
|
|
|
|
delay(1);
|
|
|
|
|
|
|
|
spa_t *spa = arg;
|
|
|
|
dsl_deadlist_entry_t *first = spa->spa_to_condense.first;
|
|
|
|
dsl_deadlist_entry_t *next = spa->spa_to_condense.next;
|
|
|
|
uint64_t first_size, next_size;
|
|
|
|
|
|
|
|
livelist_condense_arg_t *lca =
|
|
|
|
kmem_alloc(sizeof (livelist_condense_arg_t), KM_SLEEP);
|
|
|
|
bplist_create(&lca->to_keep);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Process the livelists (matching FREEs and ALLOCs) in open context
|
|
|
|
* so we have minimal work in syncing context to condense.
|
|
|
|
*
|
|
|
|
* We save bpobj sizes (first_size and next_size) to use later in
|
|
|
|
* syncing context to determine if entries were added to these sublists
|
|
|
|
* while in open context. This is possible because the clone is still
|
|
|
|
* active and open for normal writes and we want to make sure the new,
|
|
|
|
* unprocessed blockpointers are inserted into the livelist normally.
|
|
|
|
*
|
|
|
|
* Note that dsl_process_sub_livelist() both stores the size number of
|
|
|
|
* blockpointers and iterates over them while the bpobj's lock held, so
|
|
|
|
* the sizes returned to us are consistent which what was actually
|
|
|
|
* processed.
|
|
|
|
*/
|
|
|
|
int err = dsl_process_sub_livelist(&first->dle_bpobj, &lca->to_keep, t,
|
|
|
|
&first_size);
|
|
|
|
if (err == 0)
|
|
|
|
err = dsl_process_sub_livelist(&next->dle_bpobj, &lca->to_keep,
|
|
|
|
t, &next_size);
|
|
|
|
|
|
|
|
if (err == 0) {
|
|
|
|
while (zfs_livelist_condense_sync_pause &&
|
|
|
|
!(zthr_has_waiters(t) || zthr_iscancelled(t)))
|
|
|
|
delay(1);
|
|
|
|
|
|
|
|
dmu_tx_t *tx = dmu_tx_create_dd(spa_get_dsl(spa)->dp_mos_dir);
|
|
|
|
dmu_tx_mark_netfree(tx);
|
|
|
|
dmu_tx_hold_space(tx, 1);
|
|
|
|
err = dmu_tx_assign(tx, TXG_NOWAIT | TXG_NOTHROTTLE);
|
|
|
|
if (err == 0) {
|
|
|
|
/*
|
|
|
|
* Prevent the condense zthr restarting before
|
|
|
|
* the synctask completes.
|
|
|
|
*/
|
|
|
|
spa->spa_to_condense.syncing = B_TRUE;
|
|
|
|
lca->spa = spa;
|
|
|
|
lca->first_size = first_size;
|
|
|
|
lca->next_size = next_size;
|
|
|
|
dsl_sync_task_nowait(spa_get_dsl(spa),
|
2020-09-04 20:29:39 +03:00
|
|
|
spa_livelist_condense_sync, lca, tx);
|
2019-07-26 20:54:14 +03:00
|
|
|
dmu_tx_commit(tx);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* Condensing can not continue: either it was externally stopped or
|
|
|
|
* we were unable to assign to a tx because the pool has run out of
|
|
|
|
* space. In the second case, we'll just end up trying to condense
|
|
|
|
* again in a later txg.
|
|
|
|
*/
|
|
|
|
ASSERT(err != 0);
|
|
|
|
bplist_clear(&lca->to_keep);
|
|
|
|
bplist_destroy(&lca->to_keep);
|
|
|
|
kmem_free(lca, sizeof (livelist_condense_arg_t));
|
|
|
|
dmu_buf_rele(spa->spa_to_condense.ds->ds_dbuf, spa);
|
|
|
|
spa->spa_to_condense.ds = NULL;
|
|
|
|
if (err == EINTR)
|
|
|
|
zfs_livelist_condense_zthr_cancel++;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check that there is something to condense but that a condense is not
|
|
|
|
* already in progress and that condensing has not been cancelled.
|
|
|
|
*/
|
|
|
|
static boolean_t
|
|
|
|
spa_livelist_condense_cb_check(void *arg, zthr_t *z)
|
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) z;
|
2019-07-26 20:54:14 +03:00
|
|
|
spa_t *spa = arg;
|
|
|
|
if ((spa->spa_to_condense.ds != NULL) &&
|
|
|
|
(spa->spa_to_condense.syncing == B_FALSE) &&
|
|
|
|
(spa->spa_to_condense.cancelled == B_FALSE)) {
|
|
|
|
return (B_TRUE);
|
|
|
|
}
|
|
|
|
return (B_FALSE);
|
|
|
|
}
|
|
|
|
|
2020-06-15 21:30:37 +03:00
|
|
|
static void
|
2019-07-26 20:54:14 +03:00
|
|
|
spa_start_livelist_condensing_thread(spa_t *spa)
|
|
|
|
{
|
|
|
|
spa->spa_to_condense.ds = NULL;
|
|
|
|
spa->spa_to_condense.first = NULL;
|
|
|
|
spa->spa_to_condense.next = NULL;
|
|
|
|
spa->spa_to_condense.syncing = B_FALSE;
|
|
|
|
spa->spa_to_condense.cancelled = B_FALSE;
|
|
|
|
|
|
|
|
ASSERT3P(spa->spa_livelist_condense_zthr, ==, NULL);
|
2020-07-29 19:43:33 +03:00
|
|
|
spa->spa_livelist_condense_zthr =
|
|
|
|
zthr_create("z_livelist_condense",
|
|
|
|
spa_livelist_condense_cb_check,
|
2021-08-10 20:36:26 +03:00
|
|
|
spa_livelist_condense_cb, spa, minclsyspri);
|
2019-07-26 20:54:14 +03:00
|
|
|
}
|
|
|
|
|
OpenZFS 9079 - race condition in starting and ending condensing thread for indirect vdevs
The timeline of the race condition is the following:
[1] Thread A is about to finish condesing the first vdev in
spa_condense_indirect_thread(), so it calls the
spa_condense_indirect_complete_sync() sync task which sets
the spa_condensing_indirect field to NULL. Waiting for the
sync task to finish, thread A sleeps until the txg is done.
When this happens, thread A will acquire spa_async_lock and
set spa_condense_thread to NULL.
[2] While thread A waits for the txg to finish, thread B which is
running spa_sync() checks whether it should condense the
second vdev in vdev_indirect_should_condense() by checking the
spa_condensing_indirect field which was set to NULL by
spa_condense_indirect_thread() from thread A. So it goes on
and tries to spawn a new condensing thread in
spa_condense_indirect_start_sync() and the aforementioned
assertions fails because thread A has not set spa_condense_thread
to NULL (which is basically the last thing it does before returning).
The main issue here is that we rely on both spa_condensing_indirect
and spa_condense_thread to signify whether a condensing thread is
running. Ideally we would only use one throughout the codebase. In
addition, for managing spa_condense_thread we currently use
spa_async_lock which basically tights condensing to scrubing when
it comes to pausing and resuming those actions during spa export.
This commit introduces the ZTHR infrastructure, which is basically
threads created during spa_load()/spa_create() and exist until we
export or destroy the pool. ZTHRs sleep the majority of the time,
until they are notified to wake up and do some predefined type of work.
In the context of the current bug, a zthr to does the condensing of
indirect mappings replacing the older code that used bare kthreads.
When a pool is created, the condensing zthr is spawned but sleeps
right away, until it is awaken by a signal from spa_sync(). If an
existing pool is loaded, the condensing zthr looks if there is
anything to condense before going to sleep, in case we were condensing
mappings in the pool before it got exported.
The benefits of this solution are the following:
- The current bug is fixed
- spa_condensing_indirect is the sole indicator of whether we are
currently condensing or not
- condensing is more decoupled from the spa_async_thread related
functionality.
As a final note, this commit also sets up the path on upstreaming
other features that use the ZTHR code like zpool checkpoint and
fast clone deletion.
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9079
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3dc606ee
Closes #6900
2017-03-16 02:41:52 +03:00
|
|
|
static void
|
|
|
|
spa_spawn_aux_threads(spa_t *spa)
|
|
|
|
{
|
|
|
|
ASSERT(spa_writeable(spa));
|
|
|
|
|
|
|
|
ASSERT(MUTEX_HELD(&spa_namespace_lock));
|
|
|
|
|
|
|
|
spa_start_indirect_condensing_thread(spa);
|
2019-07-26 20:54:14 +03:00
|
|
|
spa_start_livelist_destroy_thread(spa);
|
|
|
|
spa_start_livelist_condensing_thread(spa);
|
2016-12-17 01:11:29 +03:00
|
|
|
|
|
|
|
ASSERT3P(spa->spa_checkpoint_discard_zthr, ==, NULL);
|
|
|
|
spa->spa_checkpoint_discard_zthr =
|
2020-07-29 19:43:33 +03:00
|
|
|
zthr_create("z_checkpoint_discard",
|
|
|
|
spa_checkpoint_discard_thread_check,
|
2021-08-10 20:36:26 +03:00
|
|
|
spa_checkpoint_discard_thread, spa, minclsyspri);
|
OpenZFS 9079 - race condition in starting and ending condensing thread for indirect vdevs
The timeline of the race condition is the following:
[1] Thread A is about to finish condesing the first vdev in
spa_condense_indirect_thread(), so it calls the
spa_condense_indirect_complete_sync() sync task which sets
the spa_condensing_indirect field to NULL. Waiting for the
sync task to finish, thread A sleeps until the txg is done.
When this happens, thread A will acquire spa_async_lock and
set spa_condense_thread to NULL.
[2] While thread A waits for the txg to finish, thread B which is
running spa_sync() checks whether it should condense the
second vdev in vdev_indirect_should_condense() by checking the
spa_condensing_indirect field which was set to NULL by
spa_condense_indirect_thread() from thread A. So it goes on
and tries to spawn a new condensing thread in
spa_condense_indirect_start_sync() and the aforementioned
assertions fails because thread A has not set spa_condense_thread
to NULL (which is basically the last thing it does before returning).
The main issue here is that we rely on both spa_condensing_indirect
and spa_condense_thread to signify whether a condensing thread is
running. Ideally we would only use one throughout the codebase. In
addition, for managing spa_condense_thread we currently use
spa_async_lock which basically tights condensing to scrubing when
it comes to pausing and resuming those actions during spa export.
This commit introduces the ZTHR infrastructure, which is basically
threads created during spa_load()/spa_create() and exist until we
export or destroy the pool. ZTHRs sleep the majority of the time,
until they are notified to wake up and do some predefined type of work.
In the context of the current bug, a zthr to does the condensing of
indirect mappings replacing the older code that used bare kthreads.
When a pool is created, the condensing zthr is spawned but sleeps
right away, until it is awaken by a signal from spa_sync(). If an
existing pool is loaded, the condensing zthr looks if there is
anything to condense before going to sleep, in case we were condensing
mappings in the pool before it got exported.
The benefits of this solution are the following:
- The current bug is fixed
- spa_condensing_indirect is the sole indicator of whether we are
currently condensing or not
- condensing is more decoupled from the spa_async_thread related
functionality.
As a final note, this commit also sets up the path on upstreaming
other features that use the ZTHR code like zpool checkpoint and
fast clone deletion.
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9079
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3dc606ee
Closes #6900
2017-03-16 02:41:52 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Fix up config after a partly-completed split. This is done with the
|
|
|
|
* ZPOOL_CONFIG_SPLIT nvlist. Both the splitting pool and the split-off
|
|
|
|
* pool have that entry in their config, but only the splitting one contains
|
|
|
|
* a list of all the guids of the vdevs that are being split off.
|
|
|
|
*
|
|
|
|
* This function determines what to do with that list: either rejoin
|
|
|
|
* all the disks to the pool, or complete the splitting process. To attempt
|
|
|
|
* the rejoin, each disk that is offlined is marked online again, and
|
|
|
|
* we do a reopen() call. If the vdev label for every disk that was
|
|
|
|
* marked online indicates it was successfully split off (VDEV_AUX_SPLIT_POOL)
|
|
|
|
* then we call vdev_split() on each disk, and complete the split.
|
|
|
|
*
|
|
|
|
* Otherwise we leave the config alone, with all the vdevs in place in
|
|
|
|
* the original pool.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
spa_try_repair(spa_t *spa, nvlist_t *config)
|
|
|
|
{
|
|
|
|
uint_t extracted;
|
|
|
|
uint64_t *glist;
|
|
|
|
uint_t i, gcount;
|
|
|
|
nvlist_t *nvl;
|
|
|
|
vdev_t **vd;
|
|
|
|
boolean_t attempt_reopen;
|
|
|
|
|
|
|
|
if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_SPLIT, &nvl) != 0)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* check that the config is complete */
|
|
|
|
if (nvlist_lookup_uint64_array(nvl, ZPOOL_CONFIG_SPLIT_LIST,
|
|
|
|
&glist, &gcount) != 0)
|
|
|
|
return;
|
|
|
|
|
2014-11-21 03:09:39 +03:00
|
|
|
vd = kmem_zalloc(gcount * sizeof (vdev_t *), KM_SLEEP);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
/* attempt to online all the vdevs & validate */
|
|
|
|
attempt_reopen = B_TRUE;
|
|
|
|
for (i = 0; i < gcount; i++) {
|
|
|
|
if (glist[i] == 0) /* vdev is hole */
|
|
|
|
continue;
|
|
|
|
|
|
|
|
vd[i] = spa_lookup_by_guid(spa, glist[i], B_FALSE);
|
|
|
|
if (vd[i] == NULL) {
|
|
|
|
/*
|
|
|
|
* Don't bother attempting to reopen the disks;
|
|
|
|
* just do the split.
|
|
|
|
*/
|
|
|
|
attempt_reopen = B_FALSE;
|
|
|
|
} else {
|
|
|
|
/* attempt to re-online it */
|
|
|
|
vd[i]->vdev_offline = B_FALSE;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (attempt_reopen) {
|
|
|
|
vdev_reopen(spa->spa_root_vdev);
|
|
|
|
|
|
|
|
/* check each device to see what state it's in */
|
|
|
|
for (extracted = 0, i = 0; i < gcount; i++) {
|
|
|
|
if (vd[i] != NULL &&
|
|
|
|
vd[i]->vdev_stat.vs_aux != VDEV_AUX_SPLIT_POOL)
|
|
|
|
break;
|
|
|
|
++extracted;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If every disk has been moved to the new pool, or if we never
|
|
|
|
* even attempted to look at them, then we split them off for
|
|
|
|
* good.
|
|
|
|
*/
|
|
|
|
if (!attempt_reopen || gcount == extracted) {
|
|
|
|
for (i = 0; i < gcount; i++)
|
|
|
|
if (vd[i] != NULL)
|
|
|
|
vdev_split(vd[i]);
|
|
|
|
vdev_reopen(spa->spa_root_vdev);
|
|
|
|
}
|
|
|
|
|
|
|
|
kmem_free(vd, gcount * sizeof (vdev_t *));
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
spa_load(spa_t *spa, spa_load_state_t state, spa_import_type_t type)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2022-04-19 21:38:30 +03:00
|
|
|
const char *ereport = FM_EREPORT_ZFS_POOL;
|
2010-05-29 00:45:14 +04:00
|
|
|
int error;
|
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
spa->spa_load_state = state;
|
zpool import progress kstat
When an import requires a long MMP activity check, or when the user
requests pool recovery, the import make take a long time. The user may
not know why, or be able to tell whether the import is progressing or is
hung.
Add a kstat which lists all imports currently being processed by the
kernel (currently only one at a time is possible, but the kstat allows
for more than one). The kstat is /proc/spl/kstat/zfs/import_progress.
The kstat contents are as follows:
pool_guid load_state multihost_secs max_txg pool_name
16667015954387398 3 15 0 tank3
load_state: the value of spa_load_state
multihost_secs: seconds until the end of the multihost activity
check; if over, or none required, this is 0
max_txg: current spa_load_max_txg, if rewind is occurring
This could be used by outside tools, such as a pacemaker resource agent,
to report import progress, or as a part of manual troubleshooting. The
zpool import subcommand could also be modified to report this
information.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #8696
2019-05-09 20:08:05 +03:00
|
|
|
(void) spa_import_progress_set_state(spa_guid(spa),
|
|
|
|
spa_load_state(spa));
|
2012-12-14 03:24:15 +04:00
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
gethrestime(&spa->spa_loaded_ts);
|
2016-12-17 01:11:29 +03:00
|
|
|
error = spa_load_impl(spa, type, &ereport);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2015-04-02 06:44:32 +03:00
|
|
|
/*
|
|
|
|
* Don't count references from objsets that are already closed
|
|
|
|
* and are making their way through the eviction process.
|
|
|
|
*/
|
|
|
|
spa_evicting_os_wait(spa);
|
2018-10-01 20:42:05 +03:00
|
|
|
spa->spa_minref = zfs_refcount_count(&spa->spa_refcount);
|
2010-08-27 01:24:34 +04:00
|
|
|
if (error) {
|
|
|
|
if (error != EEXIST) {
|
|
|
|
spa->spa_loaded_ts.tv_sec = 0;
|
|
|
|
spa->spa_loaded_ts.tv_nsec = 0;
|
|
|
|
}
|
|
|
|
if (error != EBADF) {
|
2020-09-01 05:35:11 +03:00
|
|
|
(void) zfs_ereport_post(ereport, spa,
|
2020-09-04 20:34:28 +03:00
|
|
|
NULL, NULL, NULL, 0);
|
2010-08-27 01:24:34 +04:00
|
|
|
}
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
spa->spa_load_state = error ? SPA_LOAD_ERROR : SPA_LOAD_NONE;
|
|
|
|
spa->spa_ena = 0;
|
|
|
|
|
zpool import progress kstat
When an import requires a long MMP activity check, or when the user
requests pool recovery, the import make take a long time. The user may
not know why, or be able to tell whether the import is progressing or is
hung.
Add a kstat which lists all imports currently being processed by the
kernel (currently only one at a time is possible, but the kstat allows
for more than one). The kstat is /proc/spl/kstat/zfs/import_progress.
The kstat contents are as follows:
pool_guid load_state multihost_secs max_txg pool_name
16667015954387398 3 15 0 tank3
load_state: the value of spa_load_state
multihost_secs: seconds until the end of the multihost activity
check; if over, or none required, this is 0
max_txg: current spa_load_max_txg, if rewind is occurring
This could be used by outside tools, such as a pacemaker resource agent,
to report import progress, or as a part of manual troubleshooting. The
zpool import subcommand could also be modified to report this
information.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #8696
2019-05-09 20:08:05 +03:00
|
|
|
(void) spa_import_progress_set_state(spa_guid(spa),
|
|
|
|
spa_load_state(spa));
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2016-05-07 04:14:03 +03:00
|
|
|
#ifdef ZFS_DEBUG
|
2016-04-11 23:16:57 +03:00
|
|
|
/*
|
|
|
|
* Count the number of per-vdev ZAPs associated with all of the vdevs in the
|
|
|
|
* vdev tree rooted in the given vd, and ensure that each ZAP is present in the
|
|
|
|
* spa's per-vdev ZAP list.
|
|
|
|
*/
|
|
|
|
static uint64_t
|
|
|
|
vdev_count_verify_zaps(vdev_t *vd)
|
|
|
|
{
|
|
|
|
spa_t *spa = vd->vdev_spa;
|
|
|
|
uint64_t total = 0;
|
|
|
|
|
|
|
|
if (vd->vdev_top_zap != 0) {
|
|
|
|
total++;
|
|
|
|
ASSERT0(zap_lookup_int(spa->spa_meta_objset,
|
|
|
|
spa->spa_all_vdev_zaps, vd->vdev_top_zap));
|
|
|
|
}
|
|
|
|
if (vd->vdev_leaf_zap != 0) {
|
|
|
|
total++;
|
|
|
|
ASSERT0(zap_lookup_int(spa->spa_meta_objset,
|
|
|
|
spa->spa_all_vdev_zaps, vd->vdev_leaf_zap));
|
|
|
|
}
|
|
|
|
|
2017-11-04 23:25:13 +03:00
|
|
|
for (uint64_t i = 0; i < vd->vdev_children; i++) {
|
2016-04-11 23:16:57 +03:00
|
|
|
total += vdev_count_verify_zaps(vd->vdev_child[i]);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (total);
|
|
|
|
}
|
2021-12-21 20:44:20 +03:00
|
|
|
#else
|
|
|
|
#define vdev_count_verify_zaps(vd) ((void) sizeof (vd), 0)
|
2016-05-07 04:14:03 +03:00
|
|
|
#endif
|
2016-04-11 23:16:57 +03:00
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
/*
|
|
|
|
* Determine whether the activity check is required.
|
|
|
|
*/
|
|
|
|
static boolean_t
|
2017-12-18 21:28:27 +03:00
|
|
|
spa_activity_check_required(spa_t *spa, uberblock_t *ub, nvlist_t *label,
|
|
|
|
nvlist_t *config)
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
{
|
|
|
|
uint64_t state = 0;
|
|
|
|
uint64_t hostid = 0;
|
|
|
|
uint64_t tryconfig_txg = 0;
|
|
|
|
uint64_t tryconfig_timestamp = 0;
|
MMP interval and fail_intervals in uberblock
When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test. This value, is not enough information.
If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:
zfs_multihost_fail_intervals * zfs_multihost_interval
and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.
This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence. This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.
It calculates the activity test duration differently depending on
whether the new fields are present or not; for importing pools with
only ub_mmp_delay, it uses
(zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals
Which results in an activity test duration less sensitive to the leaf
count.
In addition, it makes a few other improvements:
* It updates the "sequence" part of ub_mmp_config when MMP writes
in between syncs occur. This allows an importing host to detect MMP
on the remote host sooner, when the pool is idle, as it is not limited
to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
in immediate pool suspension.
* Update tests to verify activity check duration is based on recorded
tunable values, not tunable values on importing host.
* Update tests to verify the expected number of uberblocks have valid
MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number),
that sequence number is incrementing, and that uberblock values match
tunable settings.
Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7842
2019-03-21 22:47:57 +03:00
|
|
|
uint16_t tryconfig_mmp_seq = 0;
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
nvlist_t *nvinfo;
|
|
|
|
|
|
|
|
if (nvlist_exists(config, ZPOOL_CONFIG_LOAD_INFO)) {
|
|
|
|
nvinfo = fnvlist_lookup_nvlist(config, ZPOOL_CONFIG_LOAD_INFO);
|
|
|
|
(void) nvlist_lookup_uint64(nvinfo, ZPOOL_CONFIG_MMP_TXG,
|
|
|
|
&tryconfig_txg);
|
|
|
|
(void) nvlist_lookup_uint64(config, ZPOOL_CONFIG_TIMESTAMP,
|
|
|
|
&tryconfig_timestamp);
|
MMP interval and fail_intervals in uberblock
When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test. This value, is not enough information.
If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:
zfs_multihost_fail_intervals * zfs_multihost_interval
and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.
This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence. This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.
It calculates the activity test duration differently depending on
whether the new fields are present or not; for importing pools with
only ub_mmp_delay, it uses
(zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals
Which results in an activity test duration less sensitive to the leaf
count.
In addition, it makes a few other improvements:
* It updates the "sequence" part of ub_mmp_config when MMP writes
in between syncs occur. This allows an importing host to detect MMP
on the remote host sooner, when the pool is idle, as it is not limited
to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
in immediate pool suspension.
* Update tests to verify activity check duration is based on recorded
tunable values, not tunable values on importing host.
* Update tests to verify the expected number of uberblocks have valid
MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number),
that sequence number is incrementing, and that uberblock values match
tunable settings.
Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7842
2019-03-21 22:47:57 +03:00
|
|
|
(void) nvlist_lookup_uint16(nvinfo, ZPOOL_CONFIG_MMP_SEQ,
|
|
|
|
&tryconfig_mmp_seq);
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
(void) nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_STATE, &state);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Disable the MMP activity check - This is used by zdb which
|
|
|
|
* is intended to be used on potentially active pools.
|
|
|
|
*/
|
|
|
|
if (spa->spa_import_flags & ZFS_IMPORT_SKIP_MMP)
|
|
|
|
return (B_FALSE);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Skip the activity check when the MMP feature is disabled.
|
|
|
|
*/
|
|
|
|
if (ub->ub_mmp_magic == MMP_MAGIC && ub->ub_mmp_delay == 0)
|
|
|
|
return (B_FALSE);
|
zpool import progress kstat
When an import requires a long MMP activity check, or when the user
requests pool recovery, the import make take a long time. The user may
not know why, or be able to tell whether the import is progressing or is
hung.
Add a kstat which lists all imports currently being processed by the
kernel (currently only one at a time is possible, but the kstat allows
for more than one). The kstat is /proc/spl/kstat/zfs/import_progress.
The kstat contents are as follows:
pool_guid load_state multihost_secs max_txg pool_name
16667015954387398 3 15 0 tank3
load_state: the value of spa_load_state
multihost_secs: seconds until the end of the multihost activity
check; if over, or none required, this is 0
max_txg: current spa_load_max_txg, if rewind is occurring
This could be used by outside tools, such as a pacemaker resource agent,
to report import progress, or as a part of manual troubleshooting. The
zpool import subcommand could also be modified to report this
information.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #8696
2019-05-09 20:08:05 +03:00
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
/*
|
MMP interval and fail_intervals in uberblock
When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test. This value, is not enough information.
If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:
zfs_multihost_fail_intervals * zfs_multihost_interval
and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.
This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence. This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.
It calculates the activity test duration differently depending on
whether the new fields are present or not; for importing pools with
only ub_mmp_delay, it uses
(zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals
Which results in an activity test duration less sensitive to the leaf
count.
In addition, it makes a few other improvements:
* It updates the "sequence" part of ub_mmp_config when MMP writes
in between syncs occur. This allows an importing host to detect MMP
on the remote host sooner, when the pool is idle, as it is not limited
to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
in immediate pool suspension.
* Update tests to verify activity check duration is based on recorded
tunable values, not tunable values on importing host.
* Update tests to verify the expected number of uberblocks have valid
MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number),
that sequence number is incrementing, and that uberblock values match
tunable settings.
Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7842
2019-03-21 22:47:57 +03:00
|
|
|
* If the tryconfig_ values are nonzero, they are the results of an
|
|
|
|
* earlier tryimport. If they all match the uberblock we just found,
|
|
|
|
* then the pool has not changed and we return false so we do not test
|
|
|
|
* a second time.
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
*/
|
|
|
|
if (tryconfig_txg && tryconfig_txg == ub->ub_txg &&
|
MMP interval and fail_intervals in uberblock
When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test. This value, is not enough information.
If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:
zfs_multihost_fail_intervals * zfs_multihost_interval
and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.
This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence. This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.
It calculates the activity test duration differently depending on
whether the new fields are present or not; for importing pools with
only ub_mmp_delay, it uses
(zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals
Which results in an activity test duration less sensitive to the leaf
count.
In addition, it makes a few other improvements:
* It updates the "sequence" part of ub_mmp_config when MMP writes
in between syncs occur. This allows an importing host to detect MMP
on the remote host sooner, when the pool is idle, as it is not limited
to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
in immediate pool suspension.
* Update tests to verify activity check duration is based on recorded
tunable values, not tunable values on importing host.
* Update tests to verify the expected number of uberblocks have valid
MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number),
that sequence number is incrementing, and that uberblock values match
tunable settings.
Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7842
2019-03-21 22:47:57 +03:00
|
|
|
tryconfig_timestamp && tryconfig_timestamp == ub->ub_timestamp &&
|
|
|
|
tryconfig_mmp_seq && tryconfig_mmp_seq ==
|
|
|
|
(MMP_SEQ_VALID(ub) ? MMP_SEQ(ub) : 0))
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
return (B_FALSE);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Allow the activity check to be skipped when importing the pool
|
2017-12-18 21:28:27 +03:00
|
|
|
* on the same host which last imported it. Since the hostid from
|
|
|
|
* configuration may be stale use the one read from the label.
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
*/
|
2017-12-18 21:28:27 +03:00
|
|
|
if (nvlist_exists(label, ZPOOL_CONFIG_HOSTID))
|
|
|
|
hostid = fnvlist_lookup_uint64(label, ZPOOL_CONFIG_HOSTID);
|
|
|
|
|
2019-09-10 23:42:30 +03:00
|
|
|
if (hostid == spa_get_hostid(spa))
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
return (B_FALSE);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Skip the activity test when the pool was cleanly exported.
|
|
|
|
*/
|
|
|
|
if (state != POOL_STATE_ACTIVE)
|
|
|
|
return (B_FALSE);
|
|
|
|
|
|
|
|
return (B_TRUE);
|
|
|
|
}
|
|
|
|
|
MMP interval and fail_intervals in uberblock
When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test. This value, is not enough information.
If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:
zfs_multihost_fail_intervals * zfs_multihost_interval
and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.
This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence. This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.
It calculates the activity test duration differently depending on
whether the new fields are present or not; for importing pools with
only ub_mmp_delay, it uses
(zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals
Which results in an activity test duration less sensitive to the leaf
count.
In addition, it makes a few other improvements:
* It updates the "sequence" part of ub_mmp_config when MMP writes
in between syncs occur. This allows an importing host to detect MMP
on the remote host sooner, when the pool is idle, as it is not limited
to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
in immediate pool suspension.
* Update tests to verify activity check duration is based on recorded
tunable values, not tunable values on importing host.
* Update tests to verify the expected number of uberblocks have valid
MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number),
that sequence number is incrementing, and that uberblock values match
tunable settings.
Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7842
2019-03-21 22:47:57 +03:00
|
|
|
/*
|
|
|
|
* Nanoseconds the activity check must watch for changes on-disk.
|
|
|
|
*/
|
|
|
|
static uint64_t
|
|
|
|
spa_activity_check_duration(spa_t *spa, uberblock_t *ub)
|
|
|
|
{
|
|
|
|
uint64_t import_intervals = MAX(zfs_multihost_import_intervals, 1);
|
|
|
|
uint64_t multihost_interval = MSEC2NSEC(
|
|
|
|
MMP_INTERVAL_OK(zfs_multihost_interval));
|
|
|
|
uint64_t import_delay = MAX(NANOSEC, import_intervals *
|
|
|
|
multihost_interval);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Local tunables determine a minimum duration except for the case
|
|
|
|
* where we know when the remote host will suspend the pool if MMP
|
|
|
|
* writes do not land.
|
|
|
|
*
|
|
|
|
* See Big Theory comment at the top of mmp.c for the reasoning behind
|
|
|
|
* these cases and times.
|
|
|
|
*/
|
|
|
|
|
|
|
|
ASSERT(MMP_IMPORT_SAFETY_FACTOR >= 100);
|
|
|
|
|
|
|
|
if (MMP_INTERVAL_VALID(ub) && MMP_FAIL_INT_VALID(ub) &&
|
|
|
|
MMP_FAIL_INT(ub) > 0) {
|
|
|
|
|
|
|
|
/* MMP on remote host will suspend pool after failed writes */
|
|
|
|
import_delay = MMP_FAIL_INT(ub) * MSEC2NSEC(MMP_INTERVAL(ub)) *
|
|
|
|
MMP_IMPORT_SAFETY_FACTOR / 100;
|
|
|
|
|
|
|
|
zfs_dbgmsg("fail_intvals>0 import_delay=%llu ub_mmp "
|
|
|
|
"mmp_fails=%llu ub_mmp mmp_interval=%llu "
|
2021-06-23 07:53:45 +03:00
|
|
|
"import_intervals=%llu", (u_longlong_t)import_delay,
|
|
|
|
(u_longlong_t)MMP_FAIL_INT(ub),
|
|
|
|
(u_longlong_t)MMP_INTERVAL(ub),
|
|
|
|
(u_longlong_t)import_intervals);
|
MMP interval and fail_intervals in uberblock
When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test. This value, is not enough information.
If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:
zfs_multihost_fail_intervals * zfs_multihost_interval
and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.
This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence. This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.
It calculates the activity test duration differently depending on
whether the new fields are present or not; for importing pools with
only ub_mmp_delay, it uses
(zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals
Which results in an activity test duration less sensitive to the leaf
count.
In addition, it makes a few other improvements:
* It updates the "sequence" part of ub_mmp_config when MMP writes
in between syncs occur. This allows an importing host to detect MMP
on the remote host sooner, when the pool is idle, as it is not limited
to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
in immediate pool suspension.
* Update tests to verify activity check duration is based on recorded
tunable values, not tunable values on importing host.
* Update tests to verify the expected number of uberblocks have valid
MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number),
that sequence number is incrementing, and that uberblock values match
tunable settings.
Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7842
2019-03-21 22:47:57 +03:00
|
|
|
|
|
|
|
} else if (MMP_INTERVAL_VALID(ub) && MMP_FAIL_INT_VALID(ub) &&
|
|
|
|
MMP_FAIL_INT(ub) == 0) {
|
|
|
|
|
|
|
|
/* MMP on remote host will never suspend pool */
|
|
|
|
import_delay = MAX(import_delay, (MSEC2NSEC(MMP_INTERVAL(ub)) +
|
|
|
|
ub->ub_mmp_delay) * import_intervals);
|
|
|
|
|
|
|
|
zfs_dbgmsg("fail_intvals=0 import_delay=%llu ub_mmp "
|
|
|
|
"mmp_interval=%llu ub_mmp_delay=%llu "
|
2021-06-23 07:53:45 +03:00
|
|
|
"import_intervals=%llu", (u_longlong_t)import_delay,
|
|
|
|
(u_longlong_t)MMP_INTERVAL(ub),
|
|
|
|
(u_longlong_t)ub->ub_mmp_delay,
|
|
|
|
(u_longlong_t)import_intervals);
|
MMP interval and fail_intervals in uberblock
When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test. This value, is not enough information.
If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:
zfs_multihost_fail_intervals * zfs_multihost_interval
and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.
This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence. This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.
It calculates the activity test duration differently depending on
whether the new fields are present or not; for importing pools with
only ub_mmp_delay, it uses
(zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals
Which results in an activity test duration less sensitive to the leaf
count.
In addition, it makes a few other improvements:
* It updates the "sequence" part of ub_mmp_config when MMP writes
in between syncs occur. This allows an importing host to detect MMP
on the remote host sooner, when the pool is idle, as it is not limited
to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
in immediate pool suspension.
* Update tests to verify activity check duration is based on recorded
tunable values, not tunable values on importing host.
* Update tests to verify the expected number of uberblocks have valid
MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number),
that sequence number is incrementing, and that uberblock values match
tunable settings.
Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7842
2019-03-21 22:47:57 +03:00
|
|
|
|
|
|
|
} else if (MMP_VALID(ub)) {
|
|
|
|
/*
|
2019-09-03 03:56:41 +03:00
|
|
|
* zfs-0.7 compatibility case
|
MMP interval and fail_intervals in uberblock
When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test. This value, is not enough information.
If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:
zfs_multihost_fail_intervals * zfs_multihost_interval
and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.
This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence. This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.
It calculates the activity test duration differently depending on
whether the new fields are present or not; for importing pools with
only ub_mmp_delay, it uses
(zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals
Which results in an activity test duration less sensitive to the leaf
count.
In addition, it makes a few other improvements:
* It updates the "sequence" part of ub_mmp_config when MMP writes
in between syncs occur. This allows an importing host to detect MMP
on the remote host sooner, when the pool is idle, as it is not limited
to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
in immediate pool suspension.
* Update tests to verify activity check duration is based on recorded
tunable values, not tunable values on importing host.
* Update tests to verify the expected number of uberblocks have valid
MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number),
that sequence number is incrementing, and that uberblock values match
tunable settings.
Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7842
2019-03-21 22:47:57 +03:00
|
|
|
*/
|
|
|
|
|
|
|
|
import_delay = MAX(import_delay, (multihost_interval +
|
|
|
|
ub->ub_mmp_delay) * import_intervals);
|
|
|
|
|
|
|
|
zfs_dbgmsg("import_delay=%llu ub_mmp_delay=%llu "
|
2021-06-23 07:53:45 +03:00
|
|
|
"import_intervals=%llu leaves=%u",
|
|
|
|
(u_longlong_t)import_delay,
|
|
|
|
(u_longlong_t)ub->ub_mmp_delay,
|
|
|
|
(u_longlong_t)import_intervals,
|
MMP interval and fail_intervals in uberblock
When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test. This value, is not enough information.
If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:
zfs_multihost_fail_intervals * zfs_multihost_interval
and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.
This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence. This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.
It calculates the activity test duration differently depending on
whether the new fields are present or not; for importing pools with
only ub_mmp_delay, it uses
(zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals
Which results in an activity test duration less sensitive to the leaf
count.
In addition, it makes a few other improvements:
* It updates the "sequence" part of ub_mmp_config when MMP writes
in between syncs occur. This allows an importing host to detect MMP
on the remote host sooner, when the pool is idle, as it is not limited
to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
in immediate pool suspension.
* Update tests to verify activity check duration is based on recorded
tunable values, not tunable values on importing host.
* Update tests to verify the expected number of uberblocks have valid
MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number),
that sequence number is incrementing, and that uberblock values match
tunable settings.
Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7842
2019-03-21 22:47:57 +03:00
|
|
|
vdev_count_leaves(spa));
|
|
|
|
} else {
|
|
|
|
/* Using local tunings is the only reasonable option */
|
|
|
|
zfs_dbgmsg("pool last imported on non-MMP aware "
|
|
|
|
"host using import_delay=%llu multihost_interval=%llu "
|
2021-06-23 07:53:45 +03:00
|
|
|
"import_intervals=%llu", (u_longlong_t)import_delay,
|
|
|
|
(u_longlong_t)multihost_interval,
|
|
|
|
(u_longlong_t)import_intervals);
|
MMP interval and fail_intervals in uberblock
When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test. This value, is not enough information.
If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:
zfs_multihost_fail_intervals * zfs_multihost_interval
and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.
This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence. This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.
It calculates the activity test duration differently depending on
whether the new fields are present or not; for importing pools with
only ub_mmp_delay, it uses
(zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals
Which results in an activity test duration less sensitive to the leaf
count.
In addition, it makes a few other improvements:
* It updates the "sequence" part of ub_mmp_config when MMP writes
in between syncs occur. This allows an importing host to detect MMP
on the remote host sooner, when the pool is idle, as it is not limited
to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
in immediate pool suspension.
* Update tests to verify activity check duration is based on recorded
tunable values, not tunable values on importing host.
* Update tests to verify the expected number of uberblocks have valid
MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number),
that sequence number is incrementing, and that uberblock values match
tunable settings.
Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7842
2019-03-21 22:47:57 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
return (import_delay);
|
|
|
|
}
|
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
/*
|
|
|
|
* Perform the import activity check. If the user canceled the import or
|
|
|
|
* we detected activity then fail.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
spa_activity_check(spa_t *spa, uberblock_t *ub, nvlist_t *config)
|
|
|
|
{
|
|
|
|
uint64_t txg = ub->ub_txg;
|
|
|
|
uint64_t timestamp = ub->ub_timestamp;
|
MMP interval and fail_intervals in uberblock
When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test. This value, is not enough information.
If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:
zfs_multihost_fail_intervals * zfs_multihost_interval
and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.
This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence. This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.
It calculates the activity test duration differently depending on
whether the new fields are present or not; for importing pools with
only ub_mmp_delay, it uses
(zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals
Which results in an activity test duration less sensitive to the leaf
count.
In addition, it makes a few other improvements:
* It updates the "sequence" part of ub_mmp_config when MMP writes
in between syncs occur. This allows an importing host to detect MMP
on the remote host sooner, when the pool is idle, as it is not limited
to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
in immediate pool suspension.
* Update tests to verify activity check duration is based on recorded
tunable values, not tunable values on importing host.
* Update tests to verify the expected number of uberblocks have valid
MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number),
that sequence number is incrementing, and that uberblock values match
tunable settings.
Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7842
2019-03-21 22:47:57 +03:00
|
|
|
uint64_t mmp_config = ub->ub_mmp_config;
|
|
|
|
uint16_t mmp_seq = MMP_SEQ_VALID(ub) ? MMP_SEQ(ub) : 0;
|
|
|
|
uint64_t import_delay;
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
hrtime_t import_expire;
|
|
|
|
nvlist_t *mmp_label = NULL;
|
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
kcondvar_t cv;
|
|
|
|
kmutex_t mtx;
|
|
|
|
int error = 0;
|
|
|
|
|
|
|
|
cv_init(&cv, NULL, CV_DEFAULT, NULL);
|
|
|
|
mutex_init(&mtx, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
mutex_enter(&mtx);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If ZPOOL_CONFIG_MMP_TXG is present an activity check was performed
|
|
|
|
* during the earlier tryimport. If the txg recorded there is 0 then
|
|
|
|
* the pool is known to be active on another host.
|
|
|
|
*
|
MMP interval and fail_intervals in uberblock
When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test. This value, is not enough information.
If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:
zfs_multihost_fail_intervals * zfs_multihost_interval
and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.
This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence. This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.
It calculates the activity test duration differently depending on
whether the new fields are present or not; for importing pools with
only ub_mmp_delay, it uses
(zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals
Which results in an activity test duration less sensitive to the leaf
count.
In addition, it makes a few other improvements:
* It updates the "sequence" part of ub_mmp_config when MMP writes
in between syncs occur. This allows an importing host to detect MMP
on the remote host sooner, when the pool is idle, as it is not limited
to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
in immediate pool suspension.
* Update tests to verify activity check duration is based on recorded
tunable values, not tunable values on importing host.
* Update tests to verify the expected number of uberblocks have valid
MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number),
that sequence number is incrementing, and that uberblock values match
tunable settings.
Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7842
2019-03-21 22:47:57 +03:00
|
|
|
* Otherwise, the pool might be in use on another host. Check for
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
* changes in the uberblocks on disk if necessary.
|
|
|
|
*/
|
|
|
|
if (nvlist_exists(config, ZPOOL_CONFIG_LOAD_INFO)) {
|
|
|
|
nvlist_t *nvinfo = fnvlist_lookup_nvlist(config,
|
|
|
|
ZPOOL_CONFIG_LOAD_INFO);
|
|
|
|
|
|
|
|
if (nvlist_exists(nvinfo, ZPOOL_CONFIG_MMP_TXG) &&
|
|
|
|
fnvlist_lookup_uint64(nvinfo, ZPOOL_CONFIG_MMP_TXG) == 0) {
|
|
|
|
vdev_uberblock_load(rvd, ub, &mmp_label);
|
|
|
|
error = SET_ERROR(EREMOTEIO);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
MMP interval and fail_intervals in uberblock
When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test. This value, is not enough information.
If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:
zfs_multihost_fail_intervals * zfs_multihost_interval
and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.
This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence. This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.
It calculates the activity test duration differently depending on
whether the new fields are present or not; for importing pools with
only ub_mmp_delay, it uses
(zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals
Which results in an activity test duration less sensitive to the leaf
count.
In addition, it makes a few other improvements:
* It updates the "sequence" part of ub_mmp_config when MMP writes
in between syncs occur. This allows an importing host to detect MMP
on the remote host sooner, when the pool is idle, as it is not limited
to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
in immediate pool suspension.
* Update tests to verify activity check duration is based on recorded
tunable values, not tunable values on importing host.
* Update tests to verify the expected number of uberblocks have valid
MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number),
that sequence number is incrementing, and that uberblock values match
tunable settings.
Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7842
2019-03-21 22:47:57 +03:00
|
|
|
import_delay = spa_activity_check_duration(spa, ub);
|
Update mmp_delay on sync or skipped, failed write
When an MMP write is skipped, or fails, and time since
mts->mmp_last_write is already greater than mts->mmp_delay, increase
mts->mmp_delay. The original code only updated mts->mmp_delay when a
write succeeded, but this results in the write(s) after delays and
failed write(s) reporting an ub_mmp_delay which is too low.
Update mmp_last_write and mmp_delay if a txg sync was successful. At
least one uberblock was written, thus extending the time we can be sure
the pool will not be imported by another host.
Do not allow mmp_delay to go below (MSEC2NSEC(zfs_multihost_interval) /
vdev_count_leaves()) so that a period of frequent successful MMP writes,
e.g. due to frequent txg syncs, does not result in an import activity
check so short it is not reliable based on mmp thread writes alone.
Remove unnecessary local variable, start. We do not use the start time
of the loop iteration.
Add a debug message in spa_activity_check() to allow verification of the
import_delay value and to prove the activity check occurred.
Alter the tests that import pools and attempt to detect an activity
check. Calculate the expected duration of spa_activity_check() based on
module parameters at the time the import is performed, rather than a
fixed time set in mmp.cfg. The fixed time may be wrong. Also, use the
default zfs_multihost_interval value so the activity check is longer and
easier to recognize.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7330
2018-04-05 02:38:44 +03:00
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
/* Add a small random factor in case of simultaneous imports (0-25%) */
|
2021-06-23 02:35:23 +03:00
|
|
|
import_delay += import_delay * random_in_range(250) / 1000;
|
zpool import progress kstat
When an import requires a long MMP activity check, or when the user
requests pool recovery, the import make take a long time. The user may
not know why, or be able to tell whether the import is progressing or is
hung.
Add a kstat which lists all imports currently being processed by the
kernel (currently only one at a time is possible, but the kstat allows
for more than one). The kstat is /proc/spl/kstat/zfs/import_progress.
The kstat contents are as follows:
pool_guid load_state multihost_secs max_txg pool_name
16667015954387398 3 15 0 tank3
load_state: the value of spa_load_state
multihost_secs: seconds until the end of the multihost activity
check; if over, or none required, this is 0
max_txg: current spa_load_max_txg, if rewind is occurring
This could be used by outside tools, such as a pacemaker resource agent,
to report import progress, or as a part of manual troubleshooting. The
zpool import subcommand could also be modified to report this
information.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #8696
2019-05-09 20:08:05 +03:00
|
|
|
|
|
|
|
import_expire = gethrtime() + import_delay;
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
|
|
|
|
while (gethrtime() < import_expire) {
|
zpool import progress kstat
When an import requires a long MMP activity check, or when the user
requests pool recovery, the import make take a long time. The user may
not know why, or be able to tell whether the import is progressing or is
hung.
Add a kstat which lists all imports currently being processed by the
kernel (currently only one at a time is possible, but the kstat allows
for more than one). The kstat is /proc/spl/kstat/zfs/import_progress.
The kstat contents are as follows:
pool_guid load_state multihost_secs max_txg pool_name
16667015954387398 3 15 0 tank3
load_state: the value of spa_load_state
multihost_secs: seconds until the end of the multihost activity
check; if over, or none required, this is 0
max_txg: current spa_load_max_txg, if rewind is occurring
This could be used by outside tools, such as a pacemaker resource agent,
to report import progress, or as a part of manual troubleshooting. The
zpool import subcommand could also be modified to report this
information.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #8696
2019-05-09 20:08:05 +03:00
|
|
|
(void) spa_import_progress_set_mmp_check(spa_guid(spa),
|
|
|
|
NSEC2SEC(import_expire - gethrtime()));
|
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
vdev_uberblock_load(rvd, ub, &mmp_label);
|
|
|
|
|
MMP interval and fail_intervals in uberblock
When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test. This value, is not enough information.
If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:
zfs_multihost_fail_intervals * zfs_multihost_interval
and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.
This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence. This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.
It calculates the activity test duration differently depending on
whether the new fields are present or not; for importing pools with
only ub_mmp_delay, it uses
(zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals
Which results in an activity test duration less sensitive to the leaf
count.
In addition, it makes a few other improvements:
* It updates the "sequence" part of ub_mmp_config when MMP writes
in between syncs occur. This allows an importing host to detect MMP
on the remote host sooner, when the pool is idle, as it is not limited
to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
in immediate pool suspension.
* Update tests to verify activity check duration is based on recorded
tunable values, not tunable values on importing host.
* Update tests to verify the expected number of uberblocks have valid
MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number),
that sequence number is incrementing, and that uberblock values match
tunable settings.
Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7842
2019-03-21 22:47:57 +03:00
|
|
|
if (txg != ub->ub_txg || timestamp != ub->ub_timestamp ||
|
|
|
|
mmp_seq != (MMP_SEQ_VALID(ub) ? MMP_SEQ(ub) : 0)) {
|
|
|
|
zfs_dbgmsg("multihost activity detected "
|
|
|
|
"txg %llu ub_txg %llu "
|
|
|
|
"timestamp %llu ub_timestamp %llu "
|
|
|
|
"mmp_config %#llx ub_mmp_config %#llx",
|
2021-06-23 07:53:45 +03:00
|
|
|
(u_longlong_t)txg, (u_longlong_t)ub->ub_txg,
|
|
|
|
(u_longlong_t)timestamp,
|
|
|
|
(u_longlong_t)ub->ub_timestamp,
|
|
|
|
(u_longlong_t)mmp_config,
|
|
|
|
(u_longlong_t)ub->ub_mmp_config);
|
MMP interval and fail_intervals in uberblock
When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test. This value, is not enough information.
If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:
zfs_multihost_fail_intervals * zfs_multihost_interval
and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.
This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence. This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.
It calculates the activity test duration differently depending on
whether the new fields are present or not; for importing pools with
only ub_mmp_delay, it uses
(zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals
Which results in an activity test duration less sensitive to the leaf
count.
In addition, it makes a few other improvements:
* It updates the "sequence" part of ub_mmp_config when MMP writes
in between syncs occur. This allows an importing host to detect MMP
on the remote host sooner, when the pool is idle, as it is not limited
to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
in immediate pool suspension.
* Update tests to verify activity check duration is based on recorded
tunable values, not tunable values on importing host.
* Update tests to verify the expected number of uberblocks have valid
MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number),
that sequence number is incrementing, and that uberblock values match
tunable settings.
Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7842
2019-03-21 22:47:57 +03:00
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
error = SET_ERROR(EREMOTEIO);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (mmp_label) {
|
|
|
|
nvlist_free(mmp_label);
|
|
|
|
mmp_label = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
error = cv_timedwait_sig(&cv, &mtx, ddi_get_lbolt() + hz);
|
|
|
|
if (error != -1) {
|
|
|
|
error = SET_ERROR(EINTR);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
error = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
out:
|
|
|
|
mutex_exit(&mtx);
|
|
|
|
mutex_destroy(&mtx);
|
|
|
|
cv_destroy(&cv);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the pool is determined to be active store the status in the
|
|
|
|
* spa->spa_load_info nvlist. If the remote hostname or hostid are
|
|
|
|
* available from configuration read from disk store them as well.
|
|
|
|
* This allows 'zpool import' to generate a more useful message.
|
|
|
|
*
|
|
|
|
* ZPOOL_CONFIG_MMP_STATE - observed pool status (mandatory)
|
|
|
|
* ZPOOL_CONFIG_MMP_HOSTNAME - hostname from the active pool
|
|
|
|
* ZPOOL_CONFIG_MMP_HOSTID - hostid from the active pool
|
|
|
|
*/
|
|
|
|
if (error == EREMOTEIO) {
|
2022-04-19 21:38:30 +03:00
|
|
|
const char *hostname = "<unknown>";
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
uint64_t hostid = 0;
|
|
|
|
|
|
|
|
if (mmp_label) {
|
|
|
|
if (nvlist_exists(mmp_label, ZPOOL_CONFIG_HOSTNAME)) {
|
|
|
|
hostname = fnvlist_lookup_string(mmp_label,
|
|
|
|
ZPOOL_CONFIG_HOSTNAME);
|
|
|
|
fnvlist_add_string(spa->spa_load_info,
|
|
|
|
ZPOOL_CONFIG_MMP_HOSTNAME, hostname);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (nvlist_exists(mmp_label, ZPOOL_CONFIG_HOSTID)) {
|
|
|
|
hostid = fnvlist_lookup_uint64(mmp_label,
|
|
|
|
ZPOOL_CONFIG_HOSTID);
|
|
|
|
fnvlist_add_uint64(spa->spa_load_info,
|
|
|
|
ZPOOL_CONFIG_MMP_HOSTID, hostid);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
fnvlist_add_uint64(spa->spa_load_info,
|
|
|
|
ZPOOL_CONFIG_MMP_STATE, MMP_STATE_ACTIVE);
|
|
|
|
fnvlist_add_uint64(spa->spa_load_info,
|
|
|
|
ZPOOL_CONFIG_MMP_TXG, 0);
|
|
|
|
|
|
|
|
error = spa_vdev_err(rvd, VDEV_AUX_ACTIVE, EREMOTEIO);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (mmp_label)
|
|
|
|
nvlist_free(mmp_label);
|
|
|
|
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2016-02-23 19:49:30 +03:00
|
|
|
static int
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
spa_verify_host(spa_t *spa, nvlist_t *mos_config)
|
|
|
|
{
|
|
|
|
uint64_t hostid;
|
|
|
|
char *hostname;
|
|
|
|
uint64_t myhostid = 0;
|
|
|
|
|
|
|
|
if (!spa_is_root(spa) && nvlist_lookup_uint64(mos_config,
|
|
|
|
ZPOOL_CONFIG_HOSTID, &hostid) == 0) {
|
|
|
|
hostname = fnvlist_lookup_string(mos_config,
|
|
|
|
ZPOOL_CONFIG_HOSTNAME);
|
|
|
|
|
|
|
|
myhostid = zone_get_hostid(NULL);
|
|
|
|
|
|
|
|
if (hostid != 0 && myhostid != 0 && hostid != myhostid) {
|
|
|
|
cmn_err(CE_WARN, "pool '%s' could not be "
|
|
|
|
"loaded as it was last accessed by "
|
|
|
|
"another system (host: %s hostid: 0x%llx). "
|
2020-08-27 07:43:06 +03:00
|
|
|
"See: https://openzfs.github.io/openzfs-docs/msg/"
|
|
|
|
"ZFS-8000-EY",
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
spa_name(spa), hostname, (u_longlong_t)hostid);
|
|
|
|
spa_load_failed(spa, "hostid verification failed: pool "
|
|
|
|
"last accessed by host: %s (hostid: 0x%llx)",
|
|
|
|
hostname, (u_longlong_t)hostid);
|
|
|
|
return (SET_ERROR(EBADF));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
spa_ld_parse_config(spa_t *spa, spa_import_type_t type)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
|
|
|
int error = 0;
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
nvlist_t *nvtree, *nvl, *config = spa->spa_config;
|
2017-11-04 23:25:13 +03:00
|
|
|
int parse;
|
2016-02-23 19:49:30 +03:00
|
|
|
vdev_t *rvd;
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
uint64_t pool_guid;
|
|
|
|
char *comment;
|
2021-02-18 08:30:45 +03:00
|
|
|
char *compatibility;
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Versioning wasn't explicitly added to the label until later, so if
|
|
|
|
* it's not present treat it as the initial version.
|
|
|
|
*/
|
|
|
|
if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_VERSION,
|
|
|
|
&spa->spa_ubsync.ub_version) != 0)
|
|
|
|
spa->spa_ubsync.ub_version = SPA_VERSION_INITIAL;
|
|
|
|
|
|
|
|
if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_GUID, &pool_guid)) {
|
|
|
|
spa_load_failed(spa, "invalid config provided: '%s' missing",
|
|
|
|
ZPOOL_CONFIG_POOL_GUID);
|
|
|
|
return (SET_ERROR(EINVAL));
|
|
|
|
}
|
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
/*
|
|
|
|
* If we are doing an import, ensure that the pool is not already
|
|
|
|
* imported by checking if its pool guid already exists in the
|
|
|
|
* spa namespace.
|
|
|
|
*
|
|
|
|
* The only case that we allow an already imported pool to be
|
|
|
|
* imported again, is when the pool is checkpointed and we want to
|
|
|
|
* look at its checkpointed state from userland tools like zdb.
|
|
|
|
*/
|
|
|
|
#ifdef _KERNEL
|
|
|
|
if ((spa->spa_load_state == SPA_LOAD_IMPORT ||
|
|
|
|
spa->spa_load_state == SPA_LOAD_TRYIMPORT) &&
|
|
|
|
spa_guid_exists(pool_guid, 0)) {
|
|
|
|
#else
|
|
|
|
if ((spa->spa_load_state == SPA_LOAD_IMPORT ||
|
|
|
|
spa->spa_load_state == SPA_LOAD_TRYIMPORT) &&
|
|
|
|
spa_guid_exists(pool_guid, 0) &&
|
|
|
|
!spa_importing_readonly_checkpoint(spa)) {
|
|
|
|
#endif
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
spa_load_failed(spa, "a pool with guid %llu is already open",
|
|
|
|
(u_longlong_t)pool_guid);
|
|
|
|
return (SET_ERROR(EEXIST));
|
|
|
|
}
|
|
|
|
|
|
|
|
spa->spa_config_guid = pool_guid;
|
|
|
|
|
|
|
|
nvlist_free(spa->spa_load_info);
|
|
|
|
spa->spa_load_info = fnvlist_alloc();
|
|
|
|
|
|
|
|
ASSERT(spa->spa_comment == NULL);
|
|
|
|
if (nvlist_lookup_string(config, ZPOOL_CONFIG_COMMENT, &comment) == 0)
|
|
|
|
spa->spa_comment = spa_strdup(comment);
|
|
|
|
|
2021-02-18 08:30:45 +03:00
|
|
|
ASSERT(spa->spa_compatibility == NULL);
|
|
|
|
if (nvlist_lookup_string(config, ZPOOL_CONFIG_COMPATIBILITY,
|
|
|
|
&compatibility) == 0)
|
|
|
|
spa->spa_compatibility = spa_strdup(compatibility);
|
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
(void) nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_TXG,
|
|
|
|
&spa->spa_config_txg);
|
|
|
|
|
|
|
|
if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_SPLIT, &nvl) == 0)
|
|
|
|
spa->spa_config_splitting = fnvlist_dup(nvl);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2016-03-10 18:16:02 +03:00
|
|
|
if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE, &nvtree)) {
|
|
|
|
spa_load_failed(spa, "invalid config provided: '%s' missing",
|
|
|
|
ZPOOL_CONFIG_VDEV_TREE);
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(EINVAL));
|
2016-03-10 18:16:02 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Create "The Godfather" zio to hold all async IOs
|
|
|
|
*/
|
2014-09-17 10:59:43 +04:00
|
|
|
spa->spa_async_zio_root = kmem_alloc(max_ncpus * sizeof (void *),
|
|
|
|
KM_SLEEP);
|
2017-11-04 23:25:13 +03:00
|
|
|
for (int i = 0; i < max_ncpus; i++) {
|
2014-09-17 10:59:43 +04:00
|
|
|
spa->spa_async_zio_root[i] = zio_root(spa, NULL, NULL,
|
|
|
|
ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE |
|
|
|
|
ZIO_FLAG_GODFATHER);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Parse the configuration into a vdev tree. We explicitly set the
|
|
|
|
* value that will be returned by spa_version() since parsing the
|
|
|
|
* configuration requires knowing the version number.
|
|
|
|
*/
|
|
|
|
spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
parse = (type == SPA_IMPORT_EXISTING ?
|
|
|
|
VDEV_ALLOC_LOAD : VDEV_ALLOC_SPLIT);
|
2016-02-23 19:49:30 +03:00
|
|
|
error = spa_config_parse(spa, &rvd, nvtree, NULL, 0, parse);
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
|
|
|
|
2016-03-10 18:16:02 +03:00
|
|
|
if (error != 0) {
|
|
|
|
spa_load_failed(spa, "unable to parse config [error=%d]",
|
|
|
|
error);
|
2010-05-29 00:45:14 +04:00
|
|
|
return (error);
|
2016-03-10 18:16:02 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
ASSERT(spa->spa_root_vdev == rvd);
|
2015-05-20 07:14:01 +03:00
|
|
|
ASSERT3U(spa->spa_min_ashift, >=, SPA_MINBLOCKSHIFT);
|
|
|
|
ASSERT3U(spa->spa_max_ashift, <=, SPA_MAXBLOCKSHIFT);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (type != SPA_IMPORT_ASSEMBLE) {
|
|
|
|
ASSERT(spa_guid(spa) == pool_guid);
|
|
|
|
}
|
|
|
|
|
2016-02-23 19:49:30 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
/*
|
|
|
|
* Recursively open all vdevs in the vdev tree. This function is called twice:
|
|
|
|
* first with the untrusted config, then with the trusted config.
|
|
|
|
*/
|
2016-02-23 19:49:30 +03:00
|
|
|
static int
|
|
|
|
spa_ld_open_vdevs(spa_t *spa)
|
|
|
|
{
|
|
|
|
int error = 0;
|
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
/*
|
|
|
|
* spa_missing_tvds_allowed defines how many top-level vdevs can be
|
|
|
|
* missing/unopenable for the root vdev to be still considered openable.
|
|
|
|
*/
|
|
|
|
if (spa->spa_trust_config) {
|
|
|
|
spa->spa_missing_tvds_allowed = zfs_max_missing_tvds;
|
|
|
|
} else if (spa->spa_config_source == SPA_CONFIG_SRC_CACHEFILE) {
|
|
|
|
spa->spa_missing_tvds_allowed = zfs_max_missing_tvds_cachefile;
|
|
|
|
} else if (spa->spa_config_source == SPA_CONFIG_SRC_SCAN) {
|
|
|
|
spa->spa_missing_tvds_allowed = zfs_max_missing_tvds_scan;
|
|
|
|
} else {
|
|
|
|
spa->spa_missing_tvds_allowed = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
spa->spa_missing_tvds_allowed =
|
|
|
|
MAX(zfs_max_missing_tvds, spa->spa_missing_tvds_allowed);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
|
2016-02-23 19:49:30 +03:00
|
|
|
error = vdev_open(spa->spa_root_vdev);
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
|
|
|
|
if (spa->spa_missing_tvds != 0) {
|
|
|
|
spa_load_note(spa, "vdev tree has %lld missing top-level "
|
|
|
|
"vdevs.", (u_longlong_t)spa->spa_missing_tvds);
|
2019-11-21 20:32:57 +03:00
|
|
|
if (spa->spa_trust_config && (spa->spa_mode & SPA_MODE_WRITE)) {
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
/*
|
|
|
|
* Although theoretically we could allow users to open
|
|
|
|
* incomplete pools in RW mode, we'd need to add a lot
|
|
|
|
* of extra logic (e.g. adjust pool space to account
|
|
|
|
* for missing vdevs).
|
|
|
|
* This limitation also prevents users from accidentally
|
|
|
|
* opening the pool in RW mode during data recovery and
|
|
|
|
* damaging it further.
|
|
|
|
*/
|
|
|
|
spa_load_note(spa, "pools with missing top-level "
|
|
|
|
"vdevs can only be opened in read-only mode.");
|
|
|
|
error = SET_ERROR(ENXIO);
|
|
|
|
} else {
|
|
|
|
spa_load_note(spa, "current settings allow for maximum "
|
|
|
|
"%lld missing top-level vdevs at this stage.",
|
|
|
|
(u_longlong_t)spa->spa_missing_tvds_allowed);
|
|
|
|
}
|
|
|
|
}
|
2016-03-10 18:16:02 +03:00
|
|
|
if (error != 0) {
|
|
|
|
spa_load_failed(spa, "unable to open vdev tree [error=%d]",
|
|
|
|
error);
|
|
|
|
}
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
if (spa->spa_missing_tvds != 0 || error != 0)
|
|
|
|
vdev_dbgmsg_print_tree(spa->spa_root_vdev, 2);
|
2016-02-23 19:49:30 +03:00
|
|
|
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
/*
|
|
|
|
* We need to validate the vdev labels against the configuration that
|
|
|
|
* we have in hand. This function is called twice: first with an untrusted
|
|
|
|
* config, then with a trusted config. The validation is more strict when the
|
|
|
|
* config is trusted.
|
|
|
|
*/
|
2016-02-23 19:49:30 +03:00
|
|
|
static int
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
spa_ld_validate_vdevs(spa_t *spa)
|
2016-02-23 19:49:30 +03:00
|
|
|
{
|
|
|
|
int error = 0;
|
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
|
|
|
|
error = vdev_validate(rvd);
|
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
if (error != 0) {
|
|
|
|
spa_load_failed(spa, "vdev_validate failed [error=%d]", error);
|
|
|
|
return (error);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
if (rvd->vdev_state <= VDEV_STATE_CANT_OPEN) {
|
|
|
|
spa_load_failed(spa, "cannot open vdev tree after invalidating "
|
|
|
|
"some vdevs");
|
|
|
|
vdev_dbgmsg_print_tree(rvd, 2);
|
|
|
|
return (SET_ERROR(ENXIO));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2016-02-23 19:49:30 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
static void
|
|
|
|
spa_ld_select_uberblock_done(spa_t *spa, uberblock_t *ub)
|
|
|
|
{
|
|
|
|
spa->spa_state = POOL_STATE_ACTIVE;
|
|
|
|
spa->spa_ubsync = spa->spa_uberblock;
|
|
|
|
spa->spa_verify_min_txg = spa->spa_extreme_rewind ?
|
|
|
|
TXG_INITIAL - 1 : spa_last_synced_txg(spa) - TXG_DEFER_SIZE - 1;
|
|
|
|
spa->spa_first_txg = spa->spa_last_ubsync_txg ?
|
|
|
|
spa->spa_last_ubsync_txg : spa_last_synced_txg(spa) + 1;
|
|
|
|
spa->spa_claim_max_txg = spa->spa_first_txg;
|
|
|
|
spa->spa_prev_software_version = ub->ub_software_version;
|
|
|
|
}
|
|
|
|
|
2016-02-23 19:49:30 +03:00
|
|
|
static int
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
spa_ld_select_uberblock(spa_t *spa, spa_import_type_t type)
|
2016-02-23 19:49:30 +03:00
|
|
|
{
|
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
nvlist_t *label;
|
|
|
|
uberblock_t *ub = &spa->spa_uberblock;
|
|
|
|
boolean_t activity_check = B_FALSE;
|
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
/*
|
|
|
|
* If we are opening the checkpointed state of the pool by
|
|
|
|
* rewinding to it, at this point we will have written the
|
|
|
|
* checkpointed uberblock to the vdev labels, so searching
|
|
|
|
* the labels will find the right uberblock. However, if
|
|
|
|
* we are opening the checkpointed state read-only, we have
|
|
|
|
* not modified the labels. Therefore, we must ignore the
|
|
|
|
* labels and continue using the spa_uberblock that was set
|
|
|
|
* by spa_ld_checkpoint_rewind.
|
|
|
|
*
|
|
|
|
* Note that it would be fine to ignore the labels when
|
|
|
|
* rewinding (opening writeable) as well. However, if we
|
|
|
|
* crash just after writing the labels, we will end up
|
|
|
|
* searching the labels. Doing so in the common case means
|
|
|
|
* that this code path gets exercised normally, rather than
|
|
|
|
* just in the edge case.
|
|
|
|
*/
|
|
|
|
if (ub->ub_checkpoint_txg != 0 &&
|
|
|
|
spa_importing_readonly_checkpoint(spa)) {
|
|
|
|
spa_ld_select_uberblock_done(spa, ub);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Find the best uberblock.
|
|
|
|
*/
|
2012-12-14 03:24:15 +04:00
|
|
|
vdev_uberblock_load(rvd, ub, &label);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we weren't able to find a single valid uberblock, return failure.
|
|
|
|
*/
|
2012-12-14 03:24:15 +04:00
|
|
|
if (ub->ub_txg == 0) {
|
|
|
|
nvlist_free(label);
|
2016-03-10 18:16:02 +03:00
|
|
|
spa_load_failed(spa, "no valid uberblock found");
|
2010-05-29 00:45:14 +04:00
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, ENXIO));
|
2012-12-14 03:24:15 +04:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
zpool import progress kstat
When an import requires a long MMP activity check, or when the user
requests pool recovery, the import make take a long time. The user may
not know why, or be able to tell whether the import is progressing or is
hung.
Add a kstat which lists all imports currently being processed by the
kernel (currently only one at a time is possible, but the kstat allows
for more than one). The kstat is /proc/spl/kstat/zfs/import_progress.
The kstat contents are as follows:
pool_guid load_state multihost_secs max_txg pool_name
16667015954387398 3 15 0 tank3
load_state: the value of spa_load_state
multihost_secs: seconds until the end of the multihost activity
check; if over, or none required, this is 0
max_txg: current spa_load_max_txg, if rewind is occurring
This could be used by outside tools, such as a pacemaker resource agent,
to report import progress, or as a part of manual troubleshooting. The
zpool import subcommand could also be modified to report this
information.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #8696
2019-05-09 20:08:05 +03:00
|
|
|
if (spa->spa_load_max_txg != UINT64_MAX) {
|
|
|
|
(void) spa_import_progress_set_max_txg(spa_guid(spa),
|
|
|
|
(u_longlong_t)spa->spa_load_max_txg);
|
|
|
|
}
|
2016-03-10 18:16:02 +03:00
|
|
|
spa_load_note(spa, "using uberblock with txg=%llu",
|
|
|
|
(u_longlong_t)ub->ub_txg);
|
|
|
|
|
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
/*
|
|
|
|
* For pools which have the multihost property on determine if the
|
|
|
|
* pool is truly inactive and can be safely imported. Prevent
|
|
|
|
* hosts which don't have a hostid set from importing the pool.
|
|
|
|
*/
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
activity_check = spa_activity_check_required(spa, ub, label,
|
|
|
|
spa->spa_config);
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
if (activity_check) {
|
|
|
|
if (ub->ub_mmp_magic == MMP_MAGIC && ub->ub_mmp_delay &&
|
2019-09-10 23:42:30 +03:00
|
|
|
spa_get_hostid(spa) == 0) {
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
nvlist_free(label);
|
|
|
|
fnvlist_add_uint64(spa->spa_load_info,
|
|
|
|
ZPOOL_CONFIG_MMP_STATE, MMP_STATE_NO_HOSTID);
|
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_ACTIVE, EREMOTEIO));
|
|
|
|
}
|
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
int error = spa_activity_check(spa, ub, spa->spa_config);
|
2017-07-15 04:15:00 +03:00
|
|
|
if (error) {
|
|
|
|
nvlist_free(label);
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
fnvlist_add_uint64(spa->spa_load_info,
|
|
|
|
ZPOOL_CONFIG_MMP_STATE, MMP_STATE_INACTIVE);
|
|
|
|
fnvlist_add_uint64(spa->spa_load_info,
|
|
|
|
ZPOOL_CONFIG_MMP_TXG, ub->ub_txg);
|
MMP interval and fail_intervals in uberblock
When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test. This value, is not enough information.
If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:
zfs_multihost_fail_intervals * zfs_multihost_interval
and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.
This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence. This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.
It calculates the activity test duration differently depending on
whether the new fields are present or not; for importing pools with
only ub_mmp_delay, it uses
(zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals
Which results in an activity test duration less sensitive to the leaf
count.
In addition, it makes a few other improvements:
* It updates the "sequence" part of ub_mmp_config when MMP writes
in between syncs occur. This allows an importing host to detect MMP
on the remote host sooner, when the pool is idle, as it is not limited
to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
in immediate pool suspension.
* Update tests to verify activity check duration is based on recorded
tunable values, not tunable values on importing host.
* Update tests to verify the expected number of uberblocks have valid
MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number),
that sequence number is incrementing, and that uberblock values match
tunable settings.
Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7842
2019-03-21 22:47:57 +03:00
|
|
|
fnvlist_add_uint16(spa->spa_load_info,
|
|
|
|
ZPOOL_CONFIG_MMP_SEQ,
|
|
|
|
(MMP_SEQ_VALID(ub) ? MMP_SEQ(ub) : 0));
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
2012-12-14 03:24:15 +04:00
|
|
|
* If the pool has an unsupported version we can't open it.
|
2010-05-29 00:45:14 +04:00
|
|
|
*/
|
2012-12-14 03:24:15 +04:00
|
|
|
if (!SPA_VERSION_IS_SUPPORTED(ub->ub_version)) {
|
|
|
|
nvlist_free(label);
|
2016-03-10 18:16:02 +03:00
|
|
|
spa_load_failed(spa, "version %llu is not supported",
|
|
|
|
(u_longlong_t)ub->ub_version);
|
2010-05-29 00:45:14 +04:00
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_VERSION_NEWER, ENOTSUP));
|
2012-12-14 03:24:15 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
if (ub->ub_version >= SPA_VERSION_FEATURES) {
|
|
|
|
nvlist_t *features;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we weren't able to find what's necessary for reading the
|
|
|
|
* MOS in the label, return failure.
|
|
|
|
*/
|
2016-03-10 18:16:02 +03:00
|
|
|
if (label == NULL) {
|
|
|
|
spa_load_failed(spa, "label config unavailable");
|
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA,
|
|
|
|
ENXIO));
|
|
|
|
}
|
|
|
|
|
|
|
|
if (nvlist_lookup_nvlist(label, ZPOOL_CONFIG_FEATURES_FOR_READ,
|
|
|
|
&features) != 0) {
|
2012-12-14 03:24:15 +04:00
|
|
|
nvlist_free(label);
|
2016-03-10 18:16:02 +03:00
|
|
|
spa_load_failed(spa, "invalid label: '%s' missing",
|
|
|
|
ZPOOL_CONFIG_FEATURES_FOR_READ);
|
2012-12-14 03:24:15 +04:00
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA,
|
|
|
|
ENXIO));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Update our in-core representation with the definitive values
|
|
|
|
* from the label.
|
|
|
|
*/
|
|
|
|
nvlist_free(spa->spa_label_features);
|
2021-10-27 02:15:38 +03:00
|
|
|
spa->spa_label_features = fnvlist_dup(features);
|
2012-12-14 03:24:15 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
nvlist_free(label);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Look through entries in the label nvlist's features_for_read. If
|
|
|
|
* there is a feature listed there which we don't understand then we
|
|
|
|
* cannot open a pool.
|
|
|
|
*/
|
|
|
|
if (ub->ub_version >= SPA_VERSION_FEATURES) {
|
|
|
|
nvlist_t *unsup_feat;
|
|
|
|
|
2021-10-27 02:15:38 +03:00
|
|
|
unsup_feat = fnvlist_alloc();
|
2012-12-14 03:24:15 +04:00
|
|
|
|
2017-11-04 23:25:13 +03:00
|
|
|
for (nvpair_t *nvp = nvlist_next_nvpair(spa->spa_label_features,
|
|
|
|
NULL); nvp != NULL;
|
2012-12-14 03:24:15 +04:00
|
|
|
nvp = nvlist_next_nvpair(spa->spa_label_features, nvp)) {
|
|
|
|
if (!zfeature_is_supported(nvpair_name(nvp))) {
|
2021-10-27 02:15:38 +03:00
|
|
|
fnvlist_add_string(unsup_feat,
|
|
|
|
nvpair_name(nvp), "");
|
2012-12-14 03:24:15 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!nvlist_empty(unsup_feat)) {
|
2021-10-27 02:15:38 +03:00
|
|
|
fnvlist_add_nvlist(spa->spa_load_info,
|
|
|
|
ZPOOL_CONFIG_UNSUP_FEAT, unsup_feat);
|
2012-12-14 03:24:15 +04:00
|
|
|
nvlist_free(unsup_feat);
|
2016-03-10 18:16:02 +03:00
|
|
|
spa_load_failed(spa, "some features are unsupported");
|
2012-12-14 03:24:15 +04:00
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_UNSUP_FEAT,
|
|
|
|
ENOTSUP));
|
|
|
|
}
|
|
|
|
|
|
|
|
nvlist_free(unsup_feat);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (type != SPA_IMPORT_ASSEMBLE && spa->spa_config_splitting) {
|
|
|
|
spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
spa_try_repair(spa, spa->spa_config);
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
|
|
|
nvlist_free(spa->spa_config_splitting);
|
|
|
|
spa->spa_config_splitting = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Initialize internal SPA structures.
|
|
|
|
*/
|
2016-12-17 01:11:29 +03:00
|
|
|
spa_ld_select_uberblock_done(spa, ub);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2016-02-23 19:49:30 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
spa_ld_open_rootbp(spa_t *spa)
|
|
|
|
{
|
|
|
|
int error = 0;
|
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
2012-12-14 03:24:15 +04:00
|
|
|
error = dsl_pool_init(spa, spa->spa_first_txg, &spa->spa_dsl_pool);
|
2016-03-10 18:16:02 +03:00
|
|
|
if (error != 0) {
|
|
|
|
spa_load_failed(spa, "unable to open rootbp in dsl_pool_init "
|
|
|
|
"[error=%d]", error);
|
2010-05-29 00:45:14 +04:00
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
|
2016-03-10 18:16:02 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
spa->spa_meta_objset = spa->spa_dsl_pool->dp_meta_objset;
|
|
|
|
|
2016-02-23 19:49:30 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
2016-12-17 01:11:29 +03:00
|
|
|
spa_ld_trusted_config(spa_t *spa, spa_import_type_t type,
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
boolean_t reloading)
|
2016-02-23 19:49:30 +03:00
|
|
|
{
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
vdev_t *mrvd, *rvd = spa->spa_root_vdev;
|
|
|
|
nvlist_t *nv, *mos_config, *policy;
|
|
|
|
int error = 0, copy_error;
|
|
|
|
uint64_t healthy_tvds, healthy_tvds_mos;
|
|
|
|
uint64_t mos_config_txg;
|
2016-02-23 19:49:30 +03:00
|
|
|
|
2016-03-10 18:16:02 +03:00
|
|
|
if (spa_dir_prop(spa, DMU_POOL_CONFIG, &spa->spa_config_object, B_TRUE)
|
|
|
|
!= 0)
|
2010-05-29 00:45:14 +04:00
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
/*
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
* If we're assembling a pool from a split, the config provided is
|
|
|
|
* already trusted so there is nothing to do.
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
*/
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
if (type == SPA_IMPORT_ASSEMBLE)
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
healthy_tvds = spa_healthy_core_tvds(spa);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
if (load_nvlist(spa, spa->spa_config_object, &mos_config)
|
|
|
|
!= 0) {
|
|
|
|
spa_load_failed(spa, "unable to retrieve MOS config");
|
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we are doing an open, pool owner wasn't verified yet, thus do
|
|
|
|
* the verification here.
|
|
|
|
*/
|
|
|
|
if (spa->spa_load_state == SPA_LOAD_OPEN) {
|
|
|
|
error = spa_verify_host(spa, mos_config);
|
|
|
|
if (error != 0) {
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
nvlist_free(mos_config);
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
return (error);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
}
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
nv = fnvlist_lookup_nvlist(mos_config, ZPOOL_CONFIG_VDEV_TREE);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Build a new vdev tree from the trusted config
|
|
|
|
*/
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
error = spa_config_parse(spa, &mrvd, nv, NULL, 0, VDEV_ALLOC_LOAD);
|
|
|
|
if (error != 0) {
|
|
|
|
nvlist_free(mos_config);
|
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
|
|
|
spa_load_failed(spa, "spa_config_parse failed [error=%d]",
|
|
|
|
error);
|
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, error));
|
|
|
|
}
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Vdev paths in the MOS may be obsolete. If the untrusted config was
|
|
|
|
* obtained by scanning /dev/dsk, then it will have the right vdev
|
|
|
|
* paths. We update the trusted MOS config with this information.
|
|
|
|
* We first try to copy the paths with vdev_copy_path_strict, which
|
|
|
|
* succeeds only when both configs have exactly the same vdev tree.
|
|
|
|
* If that fails, we fall back to a more flexible method that has a
|
|
|
|
* best effort policy.
|
|
|
|
*/
|
|
|
|
copy_error = vdev_copy_path_strict(rvd, mrvd);
|
|
|
|
if (copy_error != 0 || spa_load_print_vdev_tree) {
|
|
|
|
spa_load_note(spa, "provided vdev tree:");
|
|
|
|
vdev_dbgmsg_print_tree(rvd, 2);
|
|
|
|
spa_load_note(spa, "MOS vdev tree:");
|
|
|
|
vdev_dbgmsg_print_tree(mrvd, 2);
|
|
|
|
}
|
|
|
|
if (copy_error != 0) {
|
|
|
|
spa_load_note(spa, "vdev_copy_path_strict failed, falling "
|
|
|
|
"back to vdev_copy_path_relaxed");
|
|
|
|
vdev_copy_path_relaxed(rvd, mrvd);
|
|
|
|
}
|
|
|
|
|
|
|
|
vdev_close(rvd);
|
|
|
|
vdev_free(rvd);
|
|
|
|
spa->spa_root_vdev = mrvd;
|
|
|
|
rvd = mrvd;
|
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We will use spa_config if we decide to reload the spa or if spa_load
|
|
|
|
* fails and we rewind. We must thus regenerate the config using the
|
2017-02-11 01:51:09 +03:00
|
|
|
* MOS information with the updated paths. ZPOOL_LOAD_POLICY is used to
|
|
|
|
* pass settings on how to load the pool and is not stored in the MOS.
|
|
|
|
* We copy it over to our new, trusted config.
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
*/
|
|
|
|
mos_config_txg = fnvlist_lookup_uint64(mos_config,
|
|
|
|
ZPOOL_CONFIG_POOL_TXG);
|
|
|
|
nvlist_free(mos_config);
|
|
|
|
mos_config = spa_config_generate(spa, NULL, mos_config_txg, B_FALSE);
|
2017-02-11 01:51:09 +03:00
|
|
|
if (nvlist_lookup_nvlist(spa->spa_config, ZPOOL_LOAD_POLICY,
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
&policy) == 0)
|
2017-02-11 01:51:09 +03:00
|
|
|
fnvlist_add_nvlist(mos_config, ZPOOL_LOAD_POLICY, policy);
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
spa_config_set(spa, mos_config);
|
|
|
|
spa->spa_config_source = SPA_CONFIG_SRC_MOS;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Now that we got the config from the MOS, we should be more strict
|
|
|
|
* in checking blkptrs and can make assumptions about the consistency
|
|
|
|
* of the vdev tree. spa_trust_config must be set to true before opening
|
|
|
|
* vdevs in order for them to be writeable.
|
|
|
|
*/
|
|
|
|
spa->spa_trust_config = B_TRUE;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Open and validate the new vdev tree
|
|
|
|
*/
|
|
|
|
error = spa_ld_open_vdevs(spa);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
error = spa_ld_validate_vdevs(spa);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
if (copy_error != 0 || spa_load_print_vdev_tree) {
|
|
|
|
spa_load_note(spa, "final vdev tree:");
|
|
|
|
vdev_dbgmsg_print_tree(rvd, 2);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (spa->spa_load_state != SPA_LOAD_TRYIMPORT &&
|
|
|
|
!spa->spa_extreme_rewind && zfs_max_missing_tvds == 0) {
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
/*
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
* Sanity check to make sure that we are indeed loading the
|
|
|
|
* latest uberblock. If we missed SPA_SYNC_MIN_VDEVS tvds
|
|
|
|
* in the config provided and they happened to be the only ones
|
|
|
|
* to have the latest uberblock, we could involuntarily perform
|
|
|
|
* an extreme rewind.
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
*/
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
healthy_tvds_mos = spa_healthy_core_tvds(spa);
|
|
|
|
if (healthy_tvds_mos - healthy_tvds >=
|
|
|
|
SPA_SYNC_MIN_VDEVS) {
|
|
|
|
spa_load_note(spa, "config provided misses too many "
|
|
|
|
"top-level vdevs compared to MOS (%lld vs %lld). ",
|
|
|
|
(u_longlong_t)healthy_tvds,
|
|
|
|
(u_longlong_t)healthy_tvds_mos);
|
|
|
|
spa_load_note(spa, "vdev tree:");
|
|
|
|
vdev_dbgmsg_print_tree(rvd, 2);
|
|
|
|
if (reloading) {
|
|
|
|
spa_load_failed(spa, "config was already "
|
|
|
|
"provided from MOS. Aborting.");
|
|
|
|
return (spa_vdev_err(rvd,
|
|
|
|
VDEV_AUX_CORRUPT_DATA, EIO));
|
|
|
|
}
|
|
|
|
spa_load_note(spa, "spa must be reloaded using MOS "
|
|
|
|
"config");
|
|
|
|
return (SET_ERROR(EAGAIN));
|
2016-03-10 18:16:02 +03:00
|
|
|
}
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
}
|
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
error = spa_check_for_missing_logs(spa);
|
|
|
|
if (error != 0)
|
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_BAD_GUID_SUM, ENXIO));
|
|
|
|
|
|
|
|
if (rvd->vdev_guid_sum != spa->spa_uberblock.ub_guid_sum) {
|
|
|
|
spa_load_failed(spa, "uberblock guid sum doesn't match MOS "
|
|
|
|
"guid sum (%llu != %llu)",
|
|
|
|
(u_longlong_t)spa->spa_uberblock.ub_guid_sum,
|
|
|
|
(u_longlong_t)rvd->vdev_guid_sum);
|
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_BAD_GUID_SUM,
|
|
|
|
ENXIO));
|
|
|
|
}
|
|
|
|
|
2016-02-23 19:49:30 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
spa_ld_open_indirect_vdev_metadata(spa_t *spa)
|
|
|
|
{
|
|
|
|
int error = 0;
|
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
/*
|
|
|
|
* Everything that we read before spa_remove_init() must be stored
|
|
|
|
* on concreted vdevs. Therefore we do this as early as possible.
|
|
|
|
*/
|
2016-03-10 18:16:02 +03:00
|
|
|
error = spa_remove_init(spa);
|
|
|
|
if (error != 0) {
|
|
|
|
spa_load_failed(spa, "spa_remove_init failed [error=%d]",
|
|
|
|
error);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
|
2016-03-10 18:16:02 +03:00
|
|
|
}
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
2016-02-23 19:49:30 +03:00
|
|
|
/*
|
|
|
|
* Retrieve information needed to condense indirect vdev mappings.
|
|
|
|
*/
|
|
|
|
error = spa_condense_init(spa);
|
|
|
|
if (error != 0) {
|
2016-03-10 18:16:02 +03:00
|
|
|
spa_load_failed(spa, "spa_condense_init failed [error=%d]",
|
|
|
|
error);
|
2016-02-23 19:49:30 +03:00
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, error));
|
|
|
|
}
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
2016-03-10 18:16:02 +03:00
|
|
|
spa_ld_check_features(spa_t *spa, boolean_t *missing_feat_writep)
|
2016-02-23 19:49:30 +03:00
|
|
|
{
|
|
|
|
int error = 0;
|
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
|
2012-12-14 03:24:15 +04:00
|
|
|
if (spa_version(spa) >= SPA_VERSION_FEATURES) {
|
|
|
|
boolean_t missing_feat_read = B_FALSE;
|
2012-12-15 03:00:45 +04:00
|
|
|
nvlist_t *unsup_feat, *enabled_feat;
|
2012-12-14 03:24:15 +04:00
|
|
|
|
|
|
|
if (spa_dir_prop(spa, DMU_POOL_FEATURES_FOR_READ,
|
2016-03-10 18:16:02 +03:00
|
|
|
&spa->spa_feat_for_read_obj, B_TRUE) != 0) {
|
2012-12-14 03:24:15 +04:00
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
|
|
|
|
}
|
|
|
|
|
|
|
|
if (spa_dir_prop(spa, DMU_POOL_FEATURES_FOR_WRITE,
|
2016-03-10 18:16:02 +03:00
|
|
|
&spa->spa_feat_for_write_obj, B_TRUE) != 0) {
|
2012-12-14 03:24:15 +04:00
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
|
|
|
|
}
|
|
|
|
|
|
|
|
if (spa_dir_prop(spa, DMU_POOL_FEATURE_DESCRIPTIONS,
|
2016-03-10 18:16:02 +03:00
|
|
|
&spa->spa_feat_desc_obj, B_TRUE) != 0) {
|
2012-12-14 03:24:15 +04:00
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
|
|
|
|
}
|
|
|
|
|
2012-12-15 03:00:45 +04:00
|
|
|
enabled_feat = fnvlist_alloc();
|
|
|
|
unsup_feat = fnvlist_alloc();
|
2012-12-14 03:24:15 +04:00
|
|
|
|
2013-10-08 21:13:05 +04:00
|
|
|
if (!spa_features_check(spa, B_FALSE,
|
2012-12-15 03:00:45 +04:00
|
|
|
unsup_feat, enabled_feat))
|
2012-12-14 03:24:15 +04:00
|
|
|
missing_feat_read = B_TRUE;
|
|
|
|
|
2016-03-10 18:16:02 +03:00
|
|
|
if (spa_writeable(spa) ||
|
|
|
|
spa->spa_load_state == SPA_LOAD_TRYIMPORT) {
|
2013-10-08 21:13:05 +04:00
|
|
|
if (!spa_features_check(spa, B_TRUE,
|
2012-12-15 03:00:45 +04:00
|
|
|
unsup_feat, enabled_feat)) {
|
2016-02-23 19:49:30 +03:00
|
|
|
*missing_feat_writep = B_TRUE;
|
2012-12-15 03:00:45 +04:00
|
|
|
}
|
2012-12-14 03:24:15 +04:00
|
|
|
}
|
|
|
|
|
2012-12-15 03:00:45 +04:00
|
|
|
fnvlist_add_nvlist(spa->spa_load_info,
|
|
|
|
ZPOOL_CONFIG_ENABLED_FEAT, enabled_feat);
|
|
|
|
|
2012-12-14 03:24:15 +04:00
|
|
|
if (!nvlist_empty(unsup_feat)) {
|
2012-12-15 03:00:45 +04:00
|
|
|
fnvlist_add_nvlist(spa->spa_load_info,
|
|
|
|
ZPOOL_CONFIG_UNSUP_FEAT, unsup_feat);
|
2012-12-14 03:24:15 +04:00
|
|
|
}
|
|
|
|
|
2012-12-15 03:00:45 +04:00
|
|
|
fnvlist_free(enabled_feat);
|
|
|
|
fnvlist_free(unsup_feat);
|
2012-12-14 03:24:15 +04:00
|
|
|
|
|
|
|
if (!missing_feat_read) {
|
|
|
|
fnvlist_add_boolean(spa->spa_load_info,
|
|
|
|
ZPOOL_CONFIG_CAN_RDONLY);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the state is SPA_LOAD_TRYIMPORT, our objective is
|
|
|
|
* twofold: to determine whether the pool is available for
|
|
|
|
* import in read-write mode and (if it is not) whether the
|
|
|
|
* pool is available for import in read-only mode. If the pool
|
|
|
|
* is available for import in read-write mode, it is displayed
|
|
|
|
* as available in userland; if it is not available for import
|
|
|
|
* in read-only mode, it is displayed as unavailable in
|
|
|
|
* userland. If the pool is available for import in read-only
|
|
|
|
* mode but not read-write mode, it is displayed as unavailable
|
|
|
|
* in userland with a special note that the pool is actually
|
|
|
|
* available for open in read-only mode.
|
|
|
|
*
|
|
|
|
* As a result, if the state is SPA_LOAD_TRYIMPORT and we are
|
|
|
|
* missing a feature for write, we must first determine whether
|
|
|
|
* the pool can be opened read-only before returning to
|
|
|
|
* userland in order to know whether to display the
|
|
|
|
* abovementioned note.
|
|
|
|
*/
|
2016-02-23 19:49:30 +03:00
|
|
|
if (missing_feat_read || (*missing_feat_writep &&
|
2012-12-14 03:24:15 +04:00
|
|
|
spa_writeable(spa))) {
|
2016-03-10 18:16:02 +03:00
|
|
|
spa_load_failed(spa, "pool uses unsupported features");
|
2012-12-14 03:24:15 +04:00
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_UNSUP_FEAT,
|
|
|
|
ENOTSUP));
|
|
|
|
}
|
2013-12-09 22:37:51 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Load refcounts for ZFS features from disk into an in-memory
|
|
|
|
* cache during SPA initialization.
|
|
|
|
*/
|
2017-11-04 23:25:13 +03:00
|
|
|
for (spa_feature_t i = 0; i < SPA_FEATURES; i++) {
|
2013-12-09 22:37:51 +04:00
|
|
|
uint64_t refcount;
|
|
|
|
|
|
|
|
error = feature_get_refcount_from_disk(spa,
|
|
|
|
&spa_feature_table[i], &refcount);
|
|
|
|
if (error == 0) {
|
|
|
|
spa->spa_feat_refcount_cache[i] = refcount;
|
|
|
|
} else if (error == ENOTSUP) {
|
|
|
|
spa->spa_feat_refcount_cache[i] =
|
|
|
|
SPA_FEATURE_DISABLED;
|
|
|
|
} else {
|
2016-03-10 18:16:02 +03:00
|
|
|
spa_load_failed(spa, "error getting refcount "
|
|
|
|
"for feature %s [error=%d]",
|
|
|
|
spa_feature_table[i].fi_guid, error);
|
2013-12-09 22:37:51 +04:00
|
|
|
return (spa_vdev_err(rvd,
|
|
|
|
VDEV_AUX_CORRUPT_DATA, EIO));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (spa_feature_is_active(spa, SPA_FEATURE_ENABLED_TXG)) {
|
|
|
|
if (spa_dir_prop(spa, DMU_POOL_FEATURE_ENABLED_TXG,
|
2016-03-10 18:16:02 +03:00
|
|
|
&spa->spa_feat_enabled_txg_obj, B_TRUE) != 0)
|
2013-12-09 22:37:51 +04:00
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
|
2012-12-14 03:24:15 +04:00
|
|
|
}
|
|
|
|
|
2019-02-04 22:24:55 +03:00
|
|
|
/*
|
|
|
|
* Encryption was added before bookmark_v2, even though bookmark_v2
|
|
|
|
* is now a dependency. If this pool has encryption enabled without
|
|
|
|
* bookmark_v2, trigger an errata message.
|
|
|
|
*/
|
|
|
|
if (spa_feature_is_enabled(spa, SPA_FEATURE_ENCRYPTION) &&
|
|
|
|
!spa_feature_is_enabled(spa, SPA_FEATURE_BOOKMARK_V2)) {
|
|
|
|
spa->spa_errata = ZPOOL_ERRATA_ZOL_8308_ENCRYPTION;
|
|
|
|
}
|
|
|
|
|
2016-02-23 19:49:30 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
spa_ld_load_special_directories(spa_t *spa)
|
|
|
|
{
|
|
|
|
int error = 0;
|
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
|
2012-12-14 03:24:15 +04:00
|
|
|
spa->spa_is_initializing = B_TRUE;
|
|
|
|
error = dsl_pool_open(spa->spa_dsl_pool);
|
|
|
|
spa->spa_is_initializing = B_FALSE;
|
2016-03-10 18:16:02 +03:00
|
|
|
if (error != 0) {
|
|
|
|
spa_load_failed(spa, "dsl_pool_open failed [error=%d]", error);
|
2012-12-14 03:24:15 +04:00
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
|
2016-03-10 18:16:02 +03:00
|
|
|
}
|
2012-12-14 03:24:15 +04:00
|
|
|
|
2016-02-23 19:49:30 +03:00
|
|
|
return (0);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2016-02-23 19:49:30 +03:00
|
|
|
static int
|
|
|
|
spa_ld_get_props(spa_t *spa)
|
|
|
|
{
|
|
|
|
int error = 0;
|
|
|
|
uint64_t obj;
|
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-16 01:47:05 +03:00
|
|
|
/* Grab the checksum salt from the MOS. */
|
|
|
|
error = zap_lookup(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
|
|
|
|
DMU_POOL_CHECKSUM_SALT, 1,
|
|
|
|
sizeof (spa->spa_cksum_salt.zcs_bytes),
|
|
|
|
spa->spa_cksum_salt.zcs_bytes);
|
|
|
|
if (error == ENOENT) {
|
|
|
|
/* Generate a new salt for subsequent use */
|
|
|
|
(void) random_get_pseudo_bytes(spa->spa_cksum_salt.zcs_bytes,
|
|
|
|
sizeof (spa->spa_cksum_salt.zcs_bytes));
|
|
|
|
} else if (error != 0) {
|
2016-03-10 18:16:02 +03:00
|
|
|
spa_load_failed(spa, "unable to retrieve checksum salt from "
|
|
|
|
"MOS [error=%d]", error);
|
2016-06-16 01:47:05 +03:00
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
|
|
|
|
}
|
|
|
|
|
2016-03-10 18:16:02 +03:00
|
|
|
if (spa_dir_prop(spa, DMU_POOL_SYNC_BPOBJ, &obj, B_TRUE) != 0)
|
2010-05-29 00:45:14 +04:00
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
|
|
|
|
error = bpobj_open(&spa->spa_deferred_bpobj, spa->spa_meta_objset, obj);
|
2016-03-10 18:16:02 +03:00
|
|
|
if (error != 0) {
|
|
|
|
spa_load_failed(spa, "error opening deferred-frees bpobj "
|
|
|
|
"[error=%d]", error);
|
2010-05-29 00:45:14 +04:00
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
|
2016-03-10 18:16:02 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Load the bit that tells us to use the new accounting function
|
|
|
|
* (raid-z deflation). If we have an older pool, this will not
|
|
|
|
* be present.
|
|
|
|
*/
|
2016-03-10 18:16:02 +03:00
|
|
|
error = spa_dir_prop(spa, DMU_POOL_DEFLATE, &spa->spa_deflate, B_FALSE);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error != 0 && error != ENOENT)
|
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
|
|
|
|
|
|
|
|
error = spa_dir_prop(spa, DMU_POOL_CREATION_VERSION,
|
2016-03-10 18:16:02 +03:00
|
|
|
&spa->spa_creation_version, B_FALSE);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error != 0 && error != ENOENT)
|
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Load the persistent error log. If we have an older pool, this will
|
|
|
|
* not be present.
|
|
|
|
*/
|
2016-03-10 18:16:02 +03:00
|
|
|
error = spa_dir_prop(spa, DMU_POOL_ERRLOG_LAST, &spa->spa_errlog_last,
|
|
|
|
B_FALSE);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error != 0 && error != ENOENT)
|
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
error = spa_dir_prop(spa, DMU_POOL_ERRLOG_SCRUB,
|
2016-03-10 18:16:02 +03:00
|
|
|
&spa->spa_errlog_scrub, B_FALSE);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error != 0 && error != ENOENT)
|
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2019-07-26 20:54:14 +03:00
|
|
|
/*
|
|
|
|
* Load the livelist deletion field. If a livelist is queued for
|
|
|
|
* deletion, indicate that in the spa
|
|
|
|
*/
|
|
|
|
error = spa_dir_prop(spa, DMU_POOL_DELETED_CLONES,
|
|
|
|
&spa->spa_livelists_to_delete, B_FALSE);
|
|
|
|
if (error != 0 && error != ENOENT)
|
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Load the history object. If we have an older pool, this
|
|
|
|
* will not be present.
|
|
|
|
*/
|
2016-03-10 18:16:02 +03:00
|
|
|
error = spa_dir_prop(spa, DMU_POOL_HISTORY, &spa->spa_history, B_FALSE);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error != 0 && error != ENOENT)
|
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
|
|
|
|
|
2016-04-11 23:16:57 +03:00
|
|
|
/*
|
|
|
|
* Load the per-vdev ZAP map. If we have an older pool, this will not
|
|
|
|
* be present; in this case, defer its creation to a later time to
|
|
|
|
* avoid dirtying the MOS this early / out of sync context. See
|
|
|
|
* spa_sync_config_object.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* The sentinel is only available in the MOS config. */
|
2017-11-04 23:25:13 +03:00
|
|
|
nvlist_t *mos_config;
|
2016-03-10 18:16:02 +03:00
|
|
|
if (load_nvlist(spa, spa->spa_config_object, &mos_config) != 0) {
|
|
|
|
spa_load_failed(spa, "unable to retrieve MOS config");
|
2016-04-11 23:16:57 +03:00
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
|
2016-03-10 18:16:02 +03:00
|
|
|
}
|
2016-04-11 23:16:57 +03:00
|
|
|
|
|
|
|
error = spa_dir_prop(spa, DMU_POOL_VDEV_ZAP_MAP,
|
2016-03-10 18:16:02 +03:00
|
|
|
&spa->spa_all_vdev_zaps, B_FALSE);
|
2016-04-11 23:16:57 +03:00
|
|
|
|
2017-01-14 00:50:22 +03:00
|
|
|
if (error == ENOENT) {
|
|
|
|
VERIFY(!nvlist_exists(mos_config,
|
|
|
|
ZPOOL_CONFIG_HAS_PER_VDEV_ZAPS));
|
|
|
|
spa->spa_avz_action = AVZ_ACTION_INITIALIZE;
|
|
|
|
ASSERT0(vdev_count_verify_zaps(spa->spa_root_vdev));
|
|
|
|
} else if (error != 0) {
|
2022-07-07 21:43:58 +03:00
|
|
|
nvlist_free(mos_config);
|
2016-04-11 23:16:57 +03:00
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
|
2017-01-14 00:50:22 +03:00
|
|
|
} else if (!nvlist_exists(mos_config, ZPOOL_CONFIG_HAS_PER_VDEV_ZAPS)) {
|
2016-04-11 23:16:57 +03:00
|
|
|
/*
|
|
|
|
* An older version of ZFS overwrote the sentinel value, so
|
|
|
|
* we have orphaned per-vdev ZAPs in the MOS. Defer their
|
|
|
|
* destruction to later; see spa_sync_config_object.
|
|
|
|
*/
|
|
|
|
spa->spa_avz_action = AVZ_ACTION_DESTROY;
|
|
|
|
/*
|
|
|
|
* We're assuming that no vdevs have had their ZAPs created
|
|
|
|
* before this. Better be sure of it.
|
|
|
|
*/
|
|
|
|
ASSERT0(vdev_count_verify_zaps(spa->spa_root_vdev));
|
|
|
|
}
|
|
|
|
nvlist_free(mos_config);
|
|
|
|
|
2016-02-23 19:49:30 +03:00
|
|
|
spa->spa_delegation = zpool_prop_default_numeric(ZPOOL_PROP_DELEGATION);
|
|
|
|
|
2016-03-10 18:16:02 +03:00
|
|
|
error = spa_dir_prop(spa, DMU_POOL_PROPS, &spa->spa_pool_props_object,
|
|
|
|
B_FALSE);
|
2016-02-23 19:49:30 +03:00
|
|
|
if (error && error != ENOENT)
|
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
|
|
|
|
|
|
|
|
if (error == 0) {
|
2021-07-16 16:19:59 +03:00
|
|
|
uint64_t autoreplace = 0;
|
2016-02-23 19:49:30 +03:00
|
|
|
|
|
|
|
spa_prop_find(spa, ZPOOL_PROP_BOOTFS, &spa->spa_bootfs);
|
|
|
|
spa_prop_find(spa, ZPOOL_PROP_AUTOREPLACE, &autoreplace);
|
|
|
|
spa_prop_find(spa, ZPOOL_PROP_DELEGATION, &spa->spa_delegation);
|
|
|
|
spa_prop_find(spa, ZPOOL_PROP_FAILUREMODE, &spa->spa_failmode);
|
|
|
|
spa_prop_find(spa, ZPOOL_PROP_AUTOEXPAND, &spa->spa_autoexpand);
|
2018-05-08 07:08:33 +03:00
|
|
|
spa_prop_find(spa, ZPOOL_PROP_MULTIHOST, &spa->spa_multihost);
|
2019-03-29 19:13:20 +03:00
|
|
|
spa_prop_find(spa, ZPOOL_PROP_AUTOTRIM, &spa->spa_autotrim);
|
2016-02-23 19:49:30 +03:00
|
|
|
spa->spa_autoreplace = (autoreplace != 0);
|
|
|
|
}
|
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
/*
|
|
|
|
* If we are importing a pool with missing top-level vdevs,
|
|
|
|
* we enforce that the pool doesn't panic or get suspended on
|
|
|
|
* error since the likelihood of missing data is extremely high.
|
|
|
|
*/
|
|
|
|
if (spa->spa_missing_tvds > 0 &&
|
|
|
|
spa->spa_failmode != ZIO_FAILURE_MODE_CONTINUE &&
|
|
|
|
spa->spa_load_state != SPA_LOAD_TRYIMPORT) {
|
|
|
|
spa_load_note(spa, "forcing failmode to 'continue' "
|
|
|
|
"as some top level vdevs are missing");
|
|
|
|
spa->spa_failmode = ZIO_FAILURE_MODE_CONTINUE;
|
|
|
|
}
|
|
|
|
|
2016-02-23 19:49:30 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
spa_ld_open_aux_vdevs(spa_t *spa, spa_import_type_t type)
|
|
|
|
{
|
|
|
|
int error = 0;
|
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* If we're assembling the pool from the split-off vdevs of
|
|
|
|
* an existing pool, we don't want to attach the spares & cache
|
|
|
|
* devices.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Load any hot spares for this pool.
|
|
|
|
*/
|
2016-03-10 18:16:02 +03:00
|
|
|
error = spa_dir_prop(spa, DMU_POOL_SPARES, &spa->spa_spares.sav_object,
|
|
|
|
B_FALSE);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error != 0 && error != ENOENT)
|
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
|
|
|
|
if (error == 0 && type != SPA_IMPORT_ASSEMBLE) {
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(spa_version(spa) >= SPA_VERSION_SPARES);
|
|
|
|
if (load_nvlist(spa, spa->spa_spares.sav_object,
|
2016-03-10 18:16:02 +03:00
|
|
|
&spa->spa_spares.sav_config) != 0) {
|
|
|
|
spa_load_failed(spa, "error loading spares nvlist");
|
2010-05-29 00:45:14 +04:00
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
|
2016-03-10 18:16:02 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
|
2008-11-20 23:01:55 +03:00
|
|
|
spa_load_spares(spa);
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
2010-05-29 00:45:14 +04:00
|
|
|
} else if (error == 0) {
|
|
|
|
spa->spa_spares.sav_sync = B_TRUE;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Load any level 2 ARC devices for this pool.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
error = spa_dir_prop(spa, DMU_POOL_L2CACHE,
|
2016-03-10 18:16:02 +03:00
|
|
|
&spa->spa_l2cache.sav_object, B_FALSE);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error != 0 && error != ENOENT)
|
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
|
|
|
|
if (error == 0 && type != SPA_IMPORT_ASSEMBLE) {
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(spa_version(spa) >= SPA_VERSION_L2CACHE);
|
|
|
|
if (load_nvlist(spa, spa->spa_l2cache.sav_object,
|
2016-03-10 18:16:02 +03:00
|
|
|
&spa->spa_l2cache.sav_config) != 0) {
|
|
|
|
spa_load_failed(spa, "error loading l2cache nvlist");
|
2010-05-29 00:45:14 +04:00
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
|
2016-03-10 18:16:02 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
|
2008-11-20 23:01:55 +03:00
|
|
|
spa_load_l2cache(spa);
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
2010-05-29 00:45:14 +04:00
|
|
|
} else if (error == 0) {
|
|
|
|
spa->spa_l2cache.sav_sync = B_TRUE;
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
|
|
|
|
2016-02-23 19:49:30 +03:00
|
|
|
return (0);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2016-02-23 19:49:30 +03:00
|
|
|
static int
|
2016-03-10 18:16:02 +03:00
|
|
|
spa_ld_load_vdev_metadata(spa_t *spa)
|
2016-02-23 19:49:30 +03:00
|
|
|
{
|
|
|
|
int error = 0;
|
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
/*
|
|
|
|
* If the 'multihost' property is set, then never allow a pool to
|
|
|
|
* be imported when the system hostid is zero. The exception to
|
|
|
|
* this rule is zdb which is always allowed to access pools.
|
|
|
|
*/
|
2019-09-10 23:42:30 +03:00
|
|
|
if (spa_multihost(spa) && spa_get_hostid(spa) == 0 &&
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
(spa->spa_import_flags & ZFS_IMPORT_SKIP_MMP) == 0) {
|
|
|
|
fnvlist_add_uint64(spa->spa_load_info,
|
|
|
|
ZPOOL_CONFIG_MMP_STATE, MMP_STATE_NO_HOSTID);
|
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_ACTIVE, EREMOTEIO));
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* If the 'autoreplace' property is set, then post a resource notifying
|
|
|
|
* the ZFS DE that it should not issue any faults for unopenable
|
|
|
|
* devices. We also iterate over the vdevs, and post a sysevent for any
|
|
|
|
* unopenable vdevs so that the normal autoreplace handler can take
|
|
|
|
* over.
|
|
|
|
*/
|
2016-03-10 18:16:02 +03:00
|
|
|
if (spa->spa_autoreplace && spa->spa_load_state != SPA_LOAD_TRYIMPORT) {
|
2008-11-20 23:01:55 +03:00
|
|
|
spa_check_removed(spa->spa_root_vdev);
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* For the import case, this is done in spa_import(), because
|
|
|
|
* at this point we're using the spare definitions from
|
|
|
|
* the MOS config, not necessarily from the userland config.
|
|
|
|
*/
|
2016-03-10 18:16:02 +03:00
|
|
|
if (spa->spa_load_state != SPA_LOAD_IMPORT) {
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_aux_check_removed(&spa->spa_spares);
|
|
|
|
spa_aux_check_removed(&spa->spa_l2cache);
|
|
|
|
}
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2016-02-23 19:49:30 +03:00
|
|
|
* Load the vdev metadata such as metaslabs, DTLs, spacemap object, etc.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
error = vdev_load(rvd);
|
|
|
|
if (error != 0) {
|
2016-03-10 18:16:02 +03:00
|
|
|
spa_load_failed(spa, "vdev_load failed [error=%d]", error);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, error));
|
|
|
|
}
|
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
error = spa_ld_log_spacemaps(spa);
|
|
|
|
if (error != 0) {
|
Improve log spacemap load time
Previous flushing algorithm limited only total number of log blocks to
the minimum of 256K and 4x number of metaslabs in the pool. As result,
system with 1500 disks with 1000 metaslabs each, touching several new
metaslabs each TXG could grow spacemap log to huge size without much
benefits. We've observed one of such systems importing pool for about
45 minutes.
This patch improves the situation from five sides:
- By limiting maximum period for each metaslab to be flushed to 1000
TXGs, that effectively limits maximum number of per-TXG spacemap logs
to load to the same number.
- By making flushing more smooth via accounting number of metaslabs
that were touched after the last flush and actually need another flush,
not just ms_unflushed_txg bump.
- By applying zfs_unflushed_log_block_pct to the number of metaslabs
that were touched after the last flush, not all metaslabs in the pool.
- By aggressively prefetching per-TXG spacemap logs up to 16 TXGs in
advance, making log spacemap load process for wide HDD pool CPU-bound,
accelerating it by many times.
- By reducing zfs_unflushed_log_block_max from 256K to 128K, reducing
single-threaded by nature log processing time from ~10 to ~5 minutes.
As further optimization we could skip bumping ms_unflushed_txg for
metaslabs not touched since the last flush, but that would be an
incompatible change, requiring new pool feature.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12789
2022-04-26 20:44:21 +03:00
|
|
|
spa_load_failed(spa, "spa_ld_log_spacemaps failed [error=%d]",
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
error);
|
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, error));
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2016-02-23 19:49:30 +03:00
|
|
|
* Propagate the leaf DTLs we just loaded all the way up the vdev tree.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
|
2020-07-03 21:05:50 +03:00
|
|
|
vdev_dtl_reassess(rvd, 0, 0, B_FALSE, B_FALSE);
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-02-23 19:49:30 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
spa_ld_load_dedup_tables(spa_t *spa)
|
|
|
|
{
|
|
|
|
int error = 0;
|
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
error = ddt_load(spa);
|
2016-03-10 18:16:02 +03:00
|
|
|
if (error != 0) {
|
|
|
|
spa_load_failed(spa, "ddt_load failed [error=%d]", error);
|
2010-05-29 00:45:14 +04:00
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
|
2016-03-10 18:16:02 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2016-02-23 19:49:30 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
2022-04-19 21:38:30 +03:00
|
|
|
spa_ld_verify_logs(spa_t *spa, spa_import_type_t type, const char **ereport)
|
2016-02-23 19:49:30 +03:00
|
|
|
{
|
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2016-03-10 18:16:02 +03:00
|
|
|
if (type != SPA_IMPORT_ASSEMBLE && spa_writeable(spa)) {
|
|
|
|
boolean_t missing = spa_check_logs(spa);
|
|
|
|
if (missing) {
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
if (spa->spa_missing_tvds != 0) {
|
|
|
|
spa_load_note(spa, "spa_check_logs failed "
|
|
|
|
"so dropping the logs");
|
|
|
|
} else {
|
|
|
|
*ereport = FM_EREPORT_ZFS_LOG_REPLAY;
|
|
|
|
spa_load_failed(spa, "spa_check_logs failed");
|
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_BAD_LOG,
|
|
|
|
ENXIO));
|
|
|
|
}
|
2016-03-10 18:16:02 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2016-02-23 19:49:30 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
2016-03-10 18:16:02 +03:00
|
|
|
spa_ld_verify_pool_data(spa_t *spa)
|
2016-02-23 19:49:30 +03:00
|
|
|
{
|
|
|
|
int error = 0;
|
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We've successfully opened the pool, verify that we're ready
|
|
|
|
* to start pushing transactions.
|
|
|
|
*/
|
2016-03-10 18:16:02 +03:00
|
|
|
if (spa->spa_load_state != SPA_LOAD_TRYIMPORT) {
|
2016-02-23 19:49:30 +03:00
|
|
|
error = spa_load_verify(spa);
|
|
|
|
if (error != 0) {
|
2016-03-10 18:16:02 +03:00
|
|
|
spa_load_failed(spa, "spa_load_verify failed "
|
|
|
|
"[error=%d]", error);
|
2016-02-23 19:49:30 +03:00
|
|
|
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA,
|
|
|
|
error));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
spa_ld_claim_log_blocks(spa_t *spa)
|
|
|
|
{
|
|
|
|
dmu_tx_t *tx;
|
|
|
|
dsl_pool_t *dp = spa_get_dsl(spa);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Claim log blocks that haven't been committed yet.
|
|
|
|
* This must all happen in a single txg.
|
|
|
|
* Note: spa_claim_max_txg is updated by spa_claim_notify(),
|
|
|
|
* invoked from zil_claim_log_block()'s i/o done callback.
|
|
|
|
* Price of rollback is that we abandon the log.
|
|
|
|
*/
|
|
|
|
spa->spa_claiming = B_TRUE;
|
|
|
|
|
|
|
|
tx = dmu_tx_create_assigned(dp, spa_first_txg(spa));
|
|
|
|
(void) dmu_objset_find_dp(dp, dp->dp_root_dir_obj,
|
|
|
|
zil_claim, tx, DS_FIND_CHILDREN);
|
|
|
|
dmu_tx_commit(tx);
|
|
|
|
|
|
|
|
spa->spa_claiming = B_FALSE;
|
|
|
|
|
|
|
|
spa_set_log_state(spa, SPA_LOG_GOOD);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
spa_ld_check_for_config_update(spa_t *spa, uint64_t config_cache_txg,
|
2016-12-17 01:11:29 +03:00
|
|
|
boolean_t update_config_cache)
|
2016-02-23 19:49:30 +03:00
|
|
|
{
|
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
int need_update = B_FALSE;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the config cache is stale, or we have uninitialized
|
|
|
|
* metaslabs (see spa_vdev_add()), then update the config.
|
|
|
|
*
|
|
|
|
* If this is a verbatim import, trust the current
|
|
|
|
* in-core spa_config and update the disk labels.
|
|
|
|
*/
|
2016-12-17 01:11:29 +03:00
|
|
|
if (update_config_cache || config_cache_txg != spa->spa_config_txg ||
|
2016-03-10 18:16:02 +03:00
|
|
|
spa->spa_load_state == SPA_LOAD_IMPORT ||
|
|
|
|
spa->spa_load_state == SPA_LOAD_RECOVER ||
|
2016-02-23 19:49:30 +03:00
|
|
|
(spa->spa_import_flags & ZFS_IMPORT_VERBATIM))
|
|
|
|
need_update = B_TRUE;
|
|
|
|
|
|
|
|
for (int c = 0; c < rvd->vdev_children; c++)
|
|
|
|
if (rvd->vdev_child[c]->vdev_ms_array == 0)
|
|
|
|
need_update = B_TRUE;
|
|
|
|
|
|
|
|
/*
|
2019-09-03 03:56:41 +03:00
|
|
|
* Update the config cache asynchronously in case we're the
|
2016-02-23 19:49:30 +03:00
|
|
|
* root pool, in which case the config cache isn't writable yet.
|
|
|
|
*/
|
|
|
|
if (need_update)
|
|
|
|
spa_async_request(spa, SPA_ASYNC_CONFIG_UPDATE);
|
|
|
|
}
|
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
static void
|
|
|
|
spa_ld_prepare_for_reload(spa_t *spa)
|
|
|
|
{
|
2019-11-21 20:32:57 +03:00
|
|
|
spa_mode_t mode = spa->spa_mode;
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
int async_suspended = spa->spa_async_suspended;
|
|
|
|
|
|
|
|
spa_unload(spa);
|
|
|
|
spa_deactivate(spa);
|
|
|
|
spa_activate(spa, mode);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We save the value of spa_async_suspended as it gets reset to 0 by
|
|
|
|
* spa_unload(). We want to restore it back to the original value before
|
|
|
|
* returning as we might be calling spa_async_resume() later.
|
|
|
|
*/
|
|
|
|
spa->spa_async_suspended = async_suspended;
|
|
|
|
}
|
|
|
|
|
2016-02-23 19:49:30 +03:00
|
|
|
static int
|
2016-12-17 01:11:29 +03:00
|
|
|
spa_ld_read_checkpoint_txg(spa_t *spa)
|
|
|
|
{
|
|
|
|
uberblock_t checkpoint;
|
|
|
|
int error = 0;
|
|
|
|
|
|
|
|
ASSERT0(spa->spa_checkpoint_txg);
|
|
|
|
ASSERT(MUTEX_HELD(&spa_namespace_lock));
|
|
|
|
|
|
|
|
error = zap_lookup(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
|
|
|
|
DMU_POOL_ZPOOL_CHECKPOINT, sizeof (uint64_t),
|
|
|
|
sizeof (uberblock_t) / sizeof (uint64_t), &checkpoint);
|
|
|
|
|
|
|
|
if (error == ENOENT)
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
ASSERT3U(checkpoint.ub_txg, !=, 0);
|
|
|
|
ASSERT3U(checkpoint.ub_checkpoint_txg, !=, 0);
|
|
|
|
ASSERT3U(checkpoint.ub_timestamp, !=, 0);
|
|
|
|
spa->spa_checkpoint_txg = checkpoint.ub_txg;
|
|
|
|
spa->spa_checkpoint_info.sci_timestamp = checkpoint.ub_timestamp;
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
spa_ld_mos_init(spa_t *spa, spa_import_type_t type)
|
2016-02-23 19:49:30 +03:00
|
|
|
{
|
|
|
|
int error = 0;
|
|
|
|
|
2016-03-10 18:16:02 +03:00
|
|
|
ASSERT(MUTEX_HELD(&spa_namespace_lock));
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
ASSERT(spa->spa_config_source != SPA_CONFIG_SRC_NONE);
|
2016-03-10 18:16:02 +03:00
|
|
|
|
2016-02-23 19:49:30 +03:00
|
|
|
/*
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
* Never trust the config that is provided unless we are assembling
|
|
|
|
* a pool following a split.
|
|
|
|
* This means don't trust blkptrs and the vdev tree in general. This
|
|
|
|
* also effectively puts the spa in read-only mode since
|
|
|
|
* spa_writeable() checks for spa_trust_config to be true.
|
|
|
|
* We will later load a trusted config from the MOS.
|
2016-02-23 19:49:30 +03:00
|
|
|
*/
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
if (type != SPA_IMPORT_ASSEMBLE)
|
|
|
|
spa->spa_trust_config = B_FALSE;
|
|
|
|
|
2016-02-23 19:49:30 +03:00
|
|
|
/*
|
|
|
|
* Parse the config provided to create a vdev tree.
|
|
|
|
*/
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
error = spa_ld_parse_config(spa, type);
|
2016-02-23 19:49:30 +03:00
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
|
zpool import progress kstat
When an import requires a long MMP activity check, or when the user
requests pool recovery, the import make take a long time. The user may
not know why, or be able to tell whether the import is progressing or is
hung.
Add a kstat which lists all imports currently being processed by the
kernel (currently only one at a time is possible, but the kstat allows
for more than one). The kstat is /proc/spl/kstat/zfs/import_progress.
The kstat contents are as follows:
pool_guid load_state multihost_secs max_txg pool_name
16667015954387398 3 15 0 tank3
load_state: the value of spa_load_state
multihost_secs: seconds until the end of the multihost activity
check; if over, or none required, this is 0
max_txg: current spa_load_max_txg, if rewind is occurring
This could be used by outside tools, such as a pacemaker resource agent,
to report import progress, or as a part of manual troubleshooting. The
zpool import subcommand could also be modified to report this
information.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #8696
2019-05-09 20:08:05 +03:00
|
|
|
spa_import_progress_add(spa);
|
|
|
|
|
2016-02-23 19:49:30 +03:00
|
|
|
/*
|
|
|
|
* Now that we have the vdev tree, try to open each vdev. This involves
|
|
|
|
* opening the underlying physical device, retrieving its geometry and
|
|
|
|
* probing the vdev with a dummy I/O. The state of each vdev will be set
|
|
|
|
* based on the success of those operations. After this we'll be ready
|
|
|
|
* to read from the vdevs.
|
|
|
|
*/
|
|
|
|
error = spa_ld_open_vdevs(spa);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Read the label of each vdev and make sure that the GUIDs stored
|
|
|
|
* there match the GUIDs in the config provided.
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
* If we're assembling a new pool that's been split off from an
|
|
|
|
* existing pool, the labels haven't yet been updated so we skip
|
|
|
|
* validation for now.
|
2016-02-23 19:49:30 +03:00
|
|
|
*/
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
if (type != SPA_IMPORT_ASSEMBLE) {
|
|
|
|
error = spa_ld_validate_vdevs(spa);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
}
|
2016-02-23 19:49:30 +03:00
|
|
|
|
|
|
|
/*
|
2016-12-17 01:11:29 +03:00
|
|
|
* Read all vdev labels to find the best uberblock (i.e. latest,
|
|
|
|
* unless spa_load_max_txg is set) and store it in spa_uberblock. We
|
|
|
|
* get the list of features required to read blkptrs in the MOS from
|
|
|
|
* the vdev label with the best uberblock and verify that our version
|
|
|
|
* of zfs supports them all.
|
2016-02-23 19:49:30 +03:00
|
|
|
*/
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
error = spa_ld_select_uberblock(spa, type);
|
2016-02-23 19:49:30 +03:00
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Pass that uberblock to the dsl_pool layer which will open the root
|
|
|
|
* blkptr. This blkptr points to the latest version of the MOS and will
|
|
|
|
* allow us to read its contents.
|
|
|
|
*/
|
|
|
|
error = spa_ld_open_rootbp(spa);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
spa_ld_checkpoint_rewind(spa_t *spa)
|
|
|
|
{
|
|
|
|
uberblock_t checkpoint;
|
|
|
|
int error = 0;
|
|
|
|
|
|
|
|
ASSERT(MUTEX_HELD(&spa_namespace_lock));
|
|
|
|
ASSERT(spa->spa_import_flags & ZFS_IMPORT_CHECKPOINT);
|
|
|
|
|
|
|
|
error = zap_lookup(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
|
|
|
|
DMU_POOL_ZPOOL_CHECKPOINT, sizeof (uint64_t),
|
|
|
|
sizeof (uberblock_t) / sizeof (uint64_t), &checkpoint);
|
|
|
|
|
|
|
|
if (error != 0) {
|
|
|
|
spa_load_failed(spa, "unable to retrieve checkpointed "
|
|
|
|
"uberblock from the MOS config [error=%d]", error);
|
|
|
|
|
|
|
|
if (error == ENOENT)
|
|
|
|
error = ZFS_ERR_NO_CHECKPOINT;
|
|
|
|
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
ASSERT3U(checkpoint.ub_txg, <, spa->spa_uberblock.ub_txg);
|
|
|
|
ASSERT3U(checkpoint.ub_txg, ==, checkpoint.ub_checkpoint_txg);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We need to update the txg and timestamp of the checkpointed
|
|
|
|
* uberblock to be higher than the latest one. This ensures that
|
|
|
|
* the checkpointed uberblock is selected if we were to close and
|
|
|
|
* reopen the pool right after we've written it in the vdev labels.
|
|
|
|
* (also see block comment in vdev_uberblock_compare)
|
|
|
|
*/
|
|
|
|
checkpoint.ub_txg = spa->spa_uberblock.ub_txg + 1;
|
|
|
|
checkpoint.ub_timestamp = gethrestime_sec();
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Set current uberblock to be the checkpointed uberblock.
|
|
|
|
*/
|
|
|
|
spa->spa_uberblock = checkpoint;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we are doing a normal rewind, then the pool is open for
|
|
|
|
* writing and we sync the "updated" checkpointed uberblock to
|
|
|
|
* disk. Once this is done, we've basically rewound the whole
|
|
|
|
* pool and there is no way back.
|
|
|
|
*
|
|
|
|
* There are cases when we don't want to attempt and sync the
|
|
|
|
* checkpointed uberblock to disk because we are opening a
|
|
|
|
* pool as read-only. Specifically, verifying the checkpointed
|
|
|
|
* state with zdb, and importing the checkpointed state to get
|
|
|
|
* a "preview" of its content.
|
|
|
|
*/
|
|
|
|
if (spa_writeable(spa)) {
|
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
|
|
|
|
vdev_t *svd[SPA_SYNC_MIN_VDEVS] = { NULL };
|
|
|
|
int svdcount = 0;
|
|
|
|
int children = rvd->vdev_children;
|
2021-06-23 02:35:23 +03:00
|
|
|
int c0 = random_in_range(children);
|
2016-12-17 01:11:29 +03:00
|
|
|
|
|
|
|
for (int c = 0; c < children; c++) {
|
|
|
|
vdev_t *vd = rvd->vdev_child[(c0 + c) % children];
|
|
|
|
|
|
|
|
/* Stop when revisiting the first vdev */
|
|
|
|
if (c > 0 && svd[0] == vd)
|
|
|
|
break;
|
|
|
|
|
|
|
|
if (vd->vdev_ms_array == 0 || vd->vdev_islog ||
|
|
|
|
!vdev_is_concrete(vd))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
svd[svdcount++] = vd;
|
|
|
|
if (svdcount == SPA_SYNC_MIN_VDEVS)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
error = vdev_config_sync(svd, svdcount, spa->spa_first_txg);
|
|
|
|
if (error == 0)
|
|
|
|
spa->spa_last_synced_guid = rvd->vdev_guid;
|
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
|
|
|
|
|
|
|
if (error != 0) {
|
|
|
|
spa_load_failed(spa, "failed to write checkpointed "
|
|
|
|
"uberblock to the vdev labels [error=%d]", error);
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
spa_ld_mos_with_trusted_config(spa_t *spa, spa_import_type_t type,
|
|
|
|
boolean_t *update_config_cache)
|
|
|
|
{
|
|
|
|
int error;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Parse the config for pool, open and validate vdevs,
|
|
|
|
* select an uberblock, and use that uberblock to open
|
|
|
|
* the MOS.
|
|
|
|
*/
|
|
|
|
error = spa_ld_mos_init(spa, type);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
|
2016-02-23 19:49:30 +03:00
|
|
|
/*
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
* Retrieve the trusted config stored in the MOS and use it to create
|
|
|
|
* a new, exact version of the vdev tree, then reopen all vdevs.
|
2016-02-23 19:49:30 +03:00
|
|
|
*/
|
2016-12-17 01:11:29 +03:00
|
|
|
error = spa_ld_trusted_config(spa, type, B_FALSE);
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
if (error == EAGAIN) {
|
2016-12-17 01:11:29 +03:00
|
|
|
if (update_config_cache != NULL)
|
|
|
|
*update_config_cache = B_TRUE;
|
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
/*
|
|
|
|
* Redo the loading process with the trusted config if it is
|
|
|
|
* too different from the untrusted config.
|
|
|
|
*/
|
|
|
|
spa_ld_prepare_for_reload(spa);
|
2016-12-17 01:11:29 +03:00
|
|
|
spa_load_note(spa, "RELOADING");
|
|
|
|
error = spa_ld_mos_init(spa, type);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
error = spa_ld_trusted_config(spa, type, B_TRUE);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
} else if (error != 0) {
|
2016-02-23 19:49:30 +03:00
|
|
|
return (error);
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
}
|
2016-02-23 19:49:30 +03:00
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Load an existing storage pool, using the config provided. This config
|
|
|
|
* describes which vdevs are part of the pool and is later validated against
|
|
|
|
* partial configs present in each vdev's label and an entire copy of the
|
|
|
|
* config stored in the MOS.
|
|
|
|
*/
|
|
|
|
static int
|
2022-04-19 21:38:30 +03:00
|
|
|
spa_load_impl(spa_t *spa, spa_import_type_t type, const char **ereport)
|
2016-12-17 01:11:29 +03:00
|
|
|
{
|
|
|
|
int error = 0;
|
|
|
|
boolean_t missing_feat_write = B_FALSE;
|
|
|
|
boolean_t checkpoint_rewind =
|
|
|
|
(spa->spa_import_flags & ZFS_IMPORT_CHECKPOINT);
|
|
|
|
boolean_t update_config_cache = B_FALSE;
|
|
|
|
|
|
|
|
ASSERT(MUTEX_HELD(&spa_namespace_lock));
|
|
|
|
ASSERT(spa->spa_config_source != SPA_CONFIG_SRC_NONE);
|
|
|
|
|
|
|
|
spa_load_note(spa, "LOADING");
|
|
|
|
|
|
|
|
error = spa_ld_mos_with_trusted_config(spa, type, &update_config_cache);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we are rewinding to the checkpoint then we need to repeat
|
|
|
|
* everything we've done so far in this function but this time
|
|
|
|
* selecting the checkpointed uberblock and using that to open
|
|
|
|
* the MOS.
|
|
|
|
*/
|
|
|
|
if (checkpoint_rewind) {
|
|
|
|
/*
|
|
|
|
* If we are rewinding to the checkpoint update config cache
|
|
|
|
* anyway.
|
|
|
|
*/
|
|
|
|
update_config_cache = B_TRUE;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Extract the checkpointed uberblock from the current MOS
|
|
|
|
* and use this as the pool's uberblock from now on. If the
|
|
|
|
* pool is imported as writeable we also write the checkpoint
|
|
|
|
* uberblock to the labels, making the rewind permanent.
|
|
|
|
*/
|
|
|
|
error = spa_ld_checkpoint_rewind(spa);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
/*
|
2019-09-03 03:56:41 +03:00
|
|
|
* Redo the loading process again with the
|
2016-12-17 01:11:29 +03:00
|
|
|
* checkpointed uberblock.
|
|
|
|
*/
|
|
|
|
spa_ld_prepare_for_reload(spa);
|
|
|
|
spa_load_note(spa, "LOADING checkpointed uberblock");
|
|
|
|
error = spa_ld_mos_with_trusted_config(spa, type, NULL);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Retrieve the checkpoint txg if the pool has a checkpoint.
|
|
|
|
*/
|
|
|
|
error = spa_ld_read_checkpoint_txg(spa);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
|
2016-02-23 19:49:30 +03:00
|
|
|
/*
|
|
|
|
* Retrieve the mapping of indirect vdevs. Those vdevs were removed
|
|
|
|
* from the pool and their contents were re-mapped to other vdevs. Note
|
|
|
|
* that everything that we read before this step must have been
|
|
|
|
* rewritten on concrete vdevs after the last device removal was
|
|
|
|
* initiated. Otherwise we could be reading from indirect vdevs before
|
|
|
|
* we have loaded their mappings.
|
|
|
|
*/
|
|
|
|
error = spa_ld_open_indirect_vdev_metadata(spa);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Retrieve the full list of active features from the MOS and check if
|
|
|
|
* they are all supported.
|
|
|
|
*/
|
2016-03-10 18:16:02 +03:00
|
|
|
error = spa_ld_check_features(spa, &missing_feat_write);
|
2016-02-23 19:49:30 +03:00
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Load several special directories from the MOS needed by the dsl_pool
|
|
|
|
* layer.
|
|
|
|
*/
|
|
|
|
error = spa_ld_load_special_directories(spa);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Retrieve pool properties from the MOS.
|
|
|
|
*/
|
|
|
|
error = spa_ld_get_props(spa);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Retrieve the list of auxiliary devices - cache devices and spares -
|
|
|
|
* and open them.
|
|
|
|
*/
|
|
|
|
error = spa_ld_open_aux_vdevs(spa, type);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Load the metadata for all vdevs. Also check if unopenable devices
|
|
|
|
* should be autoreplaced.
|
|
|
|
*/
|
2016-03-10 18:16:02 +03:00
|
|
|
error = spa_ld_load_vdev_metadata(spa);
|
2016-02-23 19:49:30 +03:00
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
error = spa_ld_load_dedup_tables(spa);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify the logs now to make sure we don't have any unexpected errors
|
|
|
|
* when we claim log blocks later.
|
|
|
|
*/
|
|
|
|
error = spa_ld_verify_logs(spa, type, ereport);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
|
2012-12-14 03:24:15 +04:00
|
|
|
if (missing_feat_write) {
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
ASSERT(spa->spa_load_state == SPA_LOAD_TRYIMPORT);
|
2012-12-14 03:24:15 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* At this point, we know that we can open the pool in
|
|
|
|
* read-only mode but not read-write mode. We now have enough
|
|
|
|
* information and can return to userland.
|
|
|
|
*/
|
2016-02-23 19:49:30 +03:00
|
|
|
return (spa_vdev_err(spa->spa_root_vdev, VDEV_AUX_UNSUP_FEAT,
|
|
|
|
ENOTSUP));
|
2012-12-14 03:24:15 +04:00
|
|
|
}
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
/*
|
2016-02-23 19:49:30 +03:00
|
|
|
* Traverse the last txgs to make sure the pool was left off in a safe
|
|
|
|
* state. When performing an extreme rewind, we verify the whole pool,
|
|
|
|
* which can take a very long time.
|
2010-08-27 01:24:34 +04:00
|
|
|
*/
|
2016-03-10 18:16:02 +03:00
|
|
|
error = spa_ld_verify_pool_data(spa);
|
2016-02-23 19:49:30 +03:00
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
2010-08-27 01:24:34 +04:00
|
|
|
|
2016-02-23 19:49:30 +03:00
|
|
|
/*
|
|
|
|
* Calculate the deflated space for the pool. This must be done before
|
|
|
|
* we write anything to the pool because we'd need to update the space
|
|
|
|
* accounting using the deflated sizes.
|
|
|
|
*/
|
|
|
|
spa_update_dspace(spa);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We have now retrieved all the information we needed to open the
|
|
|
|
* pool. If we are importing the pool in read-write mode, a few
|
|
|
|
* additional steps must be performed to finish the import.
|
|
|
|
*/
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
if (spa_writeable(spa) && (spa->spa_load_state == SPA_LOAD_RECOVER ||
|
2010-05-29 00:45:14 +04:00
|
|
|
spa->spa_load_max_txg == UINT64_MAX)) {
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
uint64_t config_cache_txg = spa->spa_config_txg;
|
|
|
|
|
|
|
|
ASSERT(spa->spa_load_state != SPA_LOAD_TRYIMPORT);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
/*
|
|
|
|
* In case of a checkpoint rewind, log the original txg
|
|
|
|
* of the checkpointed uberblock.
|
|
|
|
*/
|
|
|
|
if (checkpoint_rewind) {
|
|
|
|
spa_history_log_internal(spa, "checkpoint rewind",
|
|
|
|
NULL, "rewound state to txg=%llu",
|
|
|
|
(u_longlong_t)spa->spa_uberblock.ub_checkpoint_txg);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2016-02-23 19:49:30 +03:00
|
|
|
* Traverse the ZIL and claim all blocks.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2016-02-23 19:49:30 +03:00
|
|
|
spa_ld_claim_log_blocks(spa);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2016-02-23 19:49:30 +03:00
|
|
|
/*
|
|
|
|
* Kick-off the syncing thread.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
spa->spa_sync_on = B_TRUE;
|
|
|
|
txg_sync_start(spa->spa_dsl_pool);
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
mmp_thread_start(spa);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Wait for all claims to sync. We sync up to the highest
|
|
|
|
* claimed log block birth time so that claimed log blocks
|
|
|
|
* don't appear to be from the future. spa_claim_max_txg
|
2016-02-23 19:49:30 +03:00
|
|
|
* will have been set for us by ZIL traversal operations
|
|
|
|
* performed above.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
txg_wait_synced(spa->spa_dsl_pool, spa->spa_claim_max_txg);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2016-02-23 19:49:30 +03:00
|
|
|
* Check if we need to request an update of the config. On the
|
|
|
|
* next sync, we would update the config stored in vdev labels
|
|
|
|
* and the cachefile (by default /etc/zfs/zpool.cache).
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
spa_ld_check_for_config_update(spa, config_cache_txg,
|
2016-12-17 01:11:29 +03:00
|
|
|
update_config_cache);
|
2009-01-16 00:59:39 +03:00
|
|
|
|
|
|
|
/*
|
2020-07-03 21:05:50 +03:00
|
|
|
* Check if a rebuild was in progress and if so resume it.
|
|
|
|
* Then check all DTLs to see if anything needs resilvering.
|
|
|
|
* The resilver will be deferred if a rebuild was started.
|
2009-01-16 00:59:39 +03:00
|
|
|
*/
|
2020-07-03 21:05:50 +03:00
|
|
|
if (vdev_rebuild_active(spa->spa_root_vdev)) {
|
|
|
|
vdev_rebuild_restart(spa);
|
|
|
|
} else if (!dsl_scan_resilvering(spa->spa_dsl_pool) &&
|
|
|
|
vdev_resilver_needed(spa->spa_root_vdev, NULL, NULL)) {
|
2009-01-16 00:59:39 +03:00
|
|
|
spa_async_request(spa, SPA_ASYNC_RESILVER);
|
2020-07-03 21:05:50 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2013-08-28 15:45:09 +04:00
|
|
|
/*
|
|
|
|
* Log the fact that we booted up (so that we can detect if
|
|
|
|
* we rebooted in the middle of an operation).
|
|
|
|
*/
|
2017-10-23 19:45:59 +03:00
|
|
|
spa_history_log_version(spa, "open", NULL);
|
2013-08-28 15:45:09 +04:00
|
|
|
|
2018-06-17 03:39:14 +03:00
|
|
|
spa_restart_removal(spa);
|
|
|
|
spa_spawn_aux_threads(spa);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Delete any inconsistent datasets.
|
2018-06-17 03:39:14 +03:00
|
|
|
*
|
|
|
|
* Note:
|
|
|
|
* Since we may be issuing deletes for clones here,
|
|
|
|
* we make sure to do so after we've spawned all the
|
|
|
|
* auxiliary threads above (from which the livelist
|
|
|
|
* deletion zthr is part of).
|
2010-05-29 00:45:14 +04:00
|
|
|
*/
|
|
|
|
(void) dmu_objset_find(spa_name(spa),
|
|
|
|
dsl_destroy_inconsistent, NULL, DS_FIND_CHILDREN);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Clean up any stale temporary dataset userrefs.
|
|
|
|
*/
|
|
|
|
dsl_pool_clean_tmp_userrefs(spa->spa_dsl_pool);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
|
|
|
|
vdev_initialize_restart(spa->spa_root_vdev);
|
2019-03-29 19:13:20 +03:00
|
|
|
vdev_trim_restart(spa->spa_root_vdev);
|
|
|
|
vdev_autotrim_restart(spa);
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
spa_config_exit(spa, SCL_CONFIG, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
zpool import progress kstat
When an import requires a long MMP activity check, or when the user
requests pool recovery, the import make take a long time. The user may
not know why, or be able to tell whether the import is progressing or is
hung.
Add a kstat which lists all imports currently being processed by the
kernel (currently only one at a time is possible, but the kstat allows
for more than one). The kstat is /proc/spl/kstat/zfs/import_progress.
The kstat contents are as follows:
pool_guid load_state multihost_secs max_txg pool_name
16667015954387398 3 15 0 tank3
load_state: the value of spa_load_state
multihost_secs: seconds until the end of the multihost activity
check; if over, or none required, this is 0
max_txg: current spa_load_max_txg, if rewind is occurring
This could be used by outside tools, such as a pacemaker resource agent,
to report import progress, or as a part of manual troubleshooting. The
zpool import subcommand could also be modified to report this
information.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #8696
2019-05-09 20:08:05 +03:00
|
|
|
spa_import_progress_remove(spa_guid(spa));
|
2020-04-10 20:33:35 +03:00
|
|
|
spa_async_request(spa, SPA_ASYNC_L2CACHE_REBUILD);
|
|
|
|
|
2016-03-10 18:16:02 +03:00
|
|
|
spa_load_note(spa, "LOADED");
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
return (0);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static int
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
spa_load_retry(spa_t *spa, spa_load_state_t state)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2019-11-21 20:32:57 +03:00
|
|
|
spa_mode_t mode = spa->spa_mode;
|
2010-08-27 01:24:34 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_unload(spa);
|
|
|
|
spa_deactivate(spa);
|
|
|
|
|
2014-07-15 22:58:41 +04:00
|
|
|
spa->spa_load_max_txg = spa->spa_uberblock.ub_txg - 1;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
spa_activate(spa, mode);
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_async_suspend(spa);
|
|
|
|
|
2016-03-10 18:16:02 +03:00
|
|
|
spa_load_note(spa, "spa_load_retry: rewind, max txg: %llu",
|
|
|
|
(u_longlong_t)spa->spa_load_max_txg);
|
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
return (spa_load(spa, state, SPA_IMPORT_EXISTING));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2012-12-14 03:24:15 +04:00
|
|
|
/*
|
|
|
|
* If spa_load() fails this function will try loading prior txg's. If
|
|
|
|
* 'state' is SPA_LOAD_RECOVER and one of these loads succeeds the pool
|
|
|
|
* will be rewound to that txg. If 'state' is not SPA_LOAD_RECOVER this
|
|
|
|
* function will not rewind the pool and will return the same error as
|
|
|
|
* spa_load().
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
static int
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
spa_load_best(spa_t *spa, spa_load_state_t state, uint64_t max_request,
|
|
|
|
int rewind_flags)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2012-12-14 03:24:15 +04:00
|
|
|
nvlist_t *loadinfo = NULL;
|
2010-05-29 00:45:14 +04:00
|
|
|
nvlist_t *config = NULL;
|
|
|
|
int load_error, rewind_error;
|
|
|
|
uint64_t safe_rewind_txg;
|
|
|
|
uint64_t min_txg;
|
|
|
|
|
|
|
|
if (spa->spa_load_txg && state == SPA_LOAD_RECOVER) {
|
|
|
|
spa->spa_load_max_txg = spa->spa_load_txg;
|
|
|
|
spa_set_log_state(spa, SPA_LOG_CLEAR);
|
|
|
|
} else {
|
|
|
|
spa->spa_load_max_txg = max_request;
|
2014-07-15 22:58:41 +04:00
|
|
|
if (max_request != UINT64_MAX)
|
|
|
|
spa->spa_extreme_rewind = B_TRUE;
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
load_error = rewind_error = spa_load(spa, state, SPA_IMPORT_EXISTING);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (load_error == 0)
|
|
|
|
return (0);
|
2016-12-17 01:11:29 +03:00
|
|
|
if (load_error == ZFS_ERR_NO_CHECKPOINT) {
|
|
|
|
/*
|
|
|
|
* When attempting checkpoint-rewind on a pool with no
|
|
|
|
* checkpoint, we should not attempt to load uberblocks
|
|
|
|
* from previous txgs when spa_load fails.
|
|
|
|
*/
|
|
|
|
ASSERT(spa->spa_import_flags & ZFS_IMPORT_CHECKPOINT);
|
zpool import progress kstat
When an import requires a long MMP activity check, or when the user
requests pool recovery, the import make take a long time. The user may
not know why, or be able to tell whether the import is progressing or is
hung.
Add a kstat which lists all imports currently being processed by the
kernel (currently only one at a time is possible, but the kstat allows
for more than one). The kstat is /proc/spl/kstat/zfs/import_progress.
The kstat contents are as follows:
pool_guid load_state multihost_secs max_txg pool_name
16667015954387398 3 15 0 tank3
load_state: the value of spa_load_state
multihost_secs: seconds until the end of the multihost activity
check; if over, or none required, this is 0
max_txg: current spa_load_max_txg, if rewind is occurring
This could be used by outside tools, such as a pacemaker resource agent,
to report import progress, or as a part of manual troubleshooting. The
zpool import subcommand could also be modified to report this
information.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #8696
2019-05-09 20:08:05 +03:00
|
|
|
spa_import_progress_remove(spa_guid(spa));
|
2016-12-17 01:11:29 +03:00
|
|
|
return (load_error);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (spa->spa_root_vdev != NULL)
|
|
|
|
config = spa_config_generate(spa, NULL, -1ULL, B_TRUE);
|
|
|
|
|
|
|
|
spa->spa_last_ubsync_txg = spa->spa_uberblock.ub_txg;
|
|
|
|
spa->spa_last_ubsync_txg_ts = spa->spa_uberblock.ub_timestamp;
|
|
|
|
|
|
|
|
if (rewind_flags & ZPOOL_NEVER_REWIND) {
|
|
|
|
nvlist_free(config);
|
zpool import progress kstat
When an import requires a long MMP activity check, or when the user
requests pool recovery, the import make take a long time. The user may
not know why, or be able to tell whether the import is progressing or is
hung.
Add a kstat which lists all imports currently being processed by the
kernel (currently only one at a time is possible, but the kstat allows
for more than one). The kstat is /proc/spl/kstat/zfs/import_progress.
The kstat contents are as follows:
pool_guid load_state multihost_secs max_txg pool_name
16667015954387398 3 15 0 tank3
load_state: the value of spa_load_state
multihost_secs: seconds until the end of the multihost activity
check; if over, or none required, this is 0
max_txg: current spa_load_max_txg, if rewind is occurring
This could be used by outside tools, such as a pacemaker resource agent,
to report import progress, or as a part of manual troubleshooting. The
zpool import subcommand could also be modified to report this
information.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #8696
2019-05-09 20:08:05 +03:00
|
|
|
spa_import_progress_remove(spa_guid(spa));
|
2010-05-29 00:45:14 +04:00
|
|
|
return (load_error);
|
|
|
|
}
|
|
|
|
|
2012-12-14 03:24:15 +04:00
|
|
|
if (state == SPA_LOAD_RECOVER) {
|
|
|
|
/* Price of rolling back is discarding txgs, including log */
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_set_log_state(spa, SPA_LOG_CLEAR);
|
2012-12-14 03:24:15 +04:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* If we aren't rolling back save the load info from our first
|
|
|
|
* import attempt so that we can restore it after attempting
|
|
|
|
* to rewind.
|
|
|
|
*/
|
|
|
|
loadinfo = spa->spa_load_info;
|
|
|
|
spa->spa_load_info = fnvlist_alloc();
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
spa->spa_load_max_txg = spa->spa_last_ubsync_txg;
|
|
|
|
safe_rewind_txg = spa->spa_last_ubsync_txg - TXG_DEFER_SIZE;
|
|
|
|
min_txg = (rewind_flags & ZPOOL_EXTREME_REWIND) ?
|
|
|
|
TXG_INITIAL : safe_rewind_txg;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Continue as long as we're finding errors, we're still within
|
|
|
|
* the acceptable rewind range, and we're still finding uberblocks
|
|
|
|
*/
|
|
|
|
while (rewind_error && spa->spa_uberblock.ub_txg >= min_txg &&
|
|
|
|
spa->spa_uberblock.ub_txg <= spa->spa_load_max_txg) {
|
|
|
|
if (spa->spa_load_max_txg < safe_rewind_txg)
|
|
|
|
spa->spa_extreme_rewind = B_TRUE;
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
rewind_error = spa_load_retry(spa, state);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
spa->spa_extreme_rewind = B_FALSE;
|
|
|
|
spa->spa_load_max_txg = UINT64_MAX;
|
|
|
|
|
|
|
|
if (config && (rewind_error || state != SPA_LOAD_RECOVER))
|
|
|
|
spa_config_set(spa, config);
|
2016-08-06 10:08:51 +03:00
|
|
|
else
|
|
|
|
nvlist_free(config);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2012-12-14 03:24:15 +04:00
|
|
|
if (state == SPA_LOAD_RECOVER) {
|
|
|
|
ASSERT3P(loadinfo, ==, NULL);
|
zpool import progress kstat
When an import requires a long MMP activity check, or when the user
requests pool recovery, the import make take a long time. The user may
not know why, or be able to tell whether the import is progressing or is
hung.
Add a kstat which lists all imports currently being processed by the
kernel (currently only one at a time is possible, but the kstat allows
for more than one). The kstat is /proc/spl/kstat/zfs/import_progress.
The kstat contents are as follows:
pool_guid load_state multihost_secs max_txg pool_name
16667015954387398 3 15 0 tank3
load_state: the value of spa_load_state
multihost_secs: seconds until the end of the multihost activity
check; if over, or none required, this is 0
max_txg: current spa_load_max_txg, if rewind is occurring
This could be used by outside tools, such as a pacemaker resource agent,
to report import progress, or as a part of manual troubleshooting. The
zpool import subcommand could also be modified to report this
information.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #8696
2019-05-09 20:08:05 +03:00
|
|
|
spa_import_progress_remove(spa_guid(spa));
|
2012-12-14 03:24:15 +04:00
|
|
|
return (rewind_error);
|
|
|
|
} else {
|
|
|
|
/* Store the rewind info as part of the initial load info */
|
|
|
|
fnvlist_add_nvlist(loadinfo, ZPOOL_CONFIG_REWIND_INFO,
|
|
|
|
spa->spa_load_info);
|
|
|
|
|
|
|
|
/* Restore the initial load info */
|
|
|
|
fnvlist_free(spa->spa_load_info);
|
|
|
|
spa->spa_load_info = loadinfo;
|
|
|
|
|
zpool import progress kstat
When an import requires a long MMP activity check, or when the user
requests pool recovery, the import make take a long time. The user may
not know why, or be able to tell whether the import is progressing or is
hung.
Add a kstat which lists all imports currently being processed by the
kernel (currently only one at a time is possible, but the kstat allows
for more than one). The kstat is /proc/spl/kstat/zfs/import_progress.
The kstat contents are as follows:
pool_guid load_state multihost_secs max_txg pool_name
16667015954387398 3 15 0 tank3
load_state: the value of spa_load_state
multihost_secs: seconds until the end of the multihost activity
check; if over, or none required, this is 0
max_txg: current spa_load_max_txg, if rewind is occurring
This could be used by outside tools, such as a pacemaker resource agent,
to report import progress, or as a part of manual troubleshooting. The
zpool import subcommand could also be modified to report this
information.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #8696
2019-05-09 20:08:05 +03:00
|
|
|
spa_import_progress_remove(spa_guid(spa));
|
2012-12-14 03:24:15 +04:00
|
|
|
return (load_error);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Pool Open/Import
|
|
|
|
*
|
|
|
|
* The import case is identical to an open except that the configuration is sent
|
|
|
|
* down from userland, instead of grabbed from the configuration cache. For the
|
|
|
|
* case of an open, the pool configuration will exist in the
|
|
|
|
* POOL_STATE_UNINITIALIZED state.
|
|
|
|
*
|
|
|
|
* The stats information (gen/count/ustats) is used to gather vdev statistics at
|
|
|
|
* the same time open the pool, without having to keep around the spa_t in some
|
|
|
|
* ambiguous state.
|
|
|
|
*/
|
|
|
|
static int
|
2022-04-19 21:38:30 +03:00
|
|
|
spa_open_common(const char *pool, spa_t **spapp, const void *tag,
|
|
|
|
nvlist_t *nvpolicy, nvlist_t **config)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
spa_t *spa;
|
2010-08-27 01:24:34 +04:00
|
|
|
spa_load_state_t state = SPA_LOAD_OPEN;
|
2008-11-20 23:01:55 +03:00
|
|
|
int error;
|
|
|
|
int locked = B_FALSE;
|
2013-05-28 15:50:38 +04:00
|
|
|
int firstopen = B_FALSE;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
*spapp = NULL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* As disgusting as this is, we need to support recursive calls to this
|
|
|
|
* function because dsl_dir_open() is called during spa_load(), and ends
|
|
|
|
* up calling spa_open() again. The real fix is to figure out how to
|
|
|
|
* avoid dsl_dir_open() calling this in the first place.
|
|
|
|
*/
|
Simplify threads, mutexs, cvs and rwlocks
* Simplify threads, mutexs, cvs and rwlocks
* Update the zk_thread_create() function to use the same trick
as Illumos. Specifically, cast the new pthread_t to a void
pointer and return that as the kthread_t *. This avoids the
issues associated with managing a wrapper structure and is
safe as long as the callers never attempt to dereference it.
* Update all function prototypes passed to pthread_create() to
match the expected prototype. We were getting away this with
before since the function were explicitly cast.
* Replaced direct zk_thread_create() calls with thread_create()
for code consistency. All consumers of libzpool now use the
proper wrappers.
* The mutex_held() calls were converted to MUTEX_HELD().
* Removed all mutex_owner() calls and retired the interface.
Instead use MUTEX_HELD() which provides the same information
and allows the implementation details to be hidden. In this
case the use of the pthread_equals() function.
* The kthread_t, kmutex_t, krwlock_t, and krwlock_t types had
any non essential fields removed. In the case of kthread_t
and kcondvar_t they could be directly typedef'd to pthread_t
and pthread_cond_t respectively.
* Removed all extra ASSERTS from the thread, mutex, rwlock, and
cv wrapper functions. In practice, pthreads already provides
the vast majority of checks as long as we check the return
code. Removing this code from our wrappers help readability.
* Added TS_JOINABLE state flag to pass to request a joinable rather
than detached thread. This isn't a standard thread_create() state
but it's the least invasive way to pass this information and is
only used by ztest.
TEST_ZTEST_TIMEOUT=3600
Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4547
Closes #5503
Closes #5523
Closes #6377
Closes #6495
2017-08-11 18:51:44 +03:00
|
|
|
if (MUTEX_NOT_HELD(&spa_namespace_lock)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
locked = B_TRUE;
|
|
|
|
}
|
|
|
|
|
|
|
|
if ((spa = spa_lookup(pool)) == NULL) {
|
|
|
|
if (locked)
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(ENOENT));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (spa->spa_state == POOL_STATE_UNINITIALIZED) {
|
2017-02-11 01:51:09 +03:00
|
|
|
zpool_load_policy_t policy;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2013-05-28 15:50:38 +04:00
|
|
|
firstopen = B_TRUE;
|
|
|
|
|
2017-02-11 01:51:09 +03:00
|
|
|
zpool_get_load_policy(nvpolicy ? nvpolicy : spa->spa_config,
|
2010-05-29 00:45:14 +04:00
|
|
|
&policy);
|
2017-02-11 01:51:09 +03:00
|
|
|
if (policy.zlp_rewind & ZPOOL_DO_REWIND)
|
2010-05-29 00:45:14 +04:00
|
|
|
state = SPA_LOAD_RECOVER;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2009-01-16 00:59:39 +03:00
|
|
|
spa_activate(spa, spa_mode_global);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (state != SPA_LOAD_RECOVER)
|
|
|
|
spa->spa_last_ubsync_txg = spa->spa_load_txg = 0;
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
spa->spa_config_source = SPA_CONFIG_SRC_CACHEFILE;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2016-03-10 18:16:02 +03:00
|
|
|
zfs_dbgmsg("spa_open_common: opening %s", pool);
|
2017-02-11 01:51:09 +03:00
|
|
|
error = spa_load_best(spa, state, policy.zlp_txg,
|
|
|
|
policy.zlp_rewind);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (error == EBADF) {
|
|
|
|
/*
|
|
|
|
* If vdev_validate() returns failure (indicated by
|
|
|
|
* EBADF), it indicates that one of the vdevs indicates
|
|
|
|
* that the pool has been exported or destroyed. If
|
|
|
|
* this is the case, the config cache is out of sync and
|
|
|
|
* we should remove the pool from the namespace.
|
|
|
|
*/
|
|
|
|
spa_unload(spa);
|
|
|
|
spa_deactivate(spa);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
spa_write_cachefile(spa, B_TRUE, B_TRUE);
|
2008-11-20 23:01:55 +03:00
|
|
|
spa_remove(spa);
|
|
|
|
if (locked)
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(ENOENT));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
if (error) {
|
|
|
|
/*
|
|
|
|
* We can't open the pool, but we still have useful
|
|
|
|
* information: the state of each vdev after the
|
|
|
|
* attempted vdev_open(). Return this to the user.
|
|
|
|
*/
|
2010-08-27 01:24:34 +04:00
|
|
|
if (config != NULL && spa->spa_config) {
|
2021-10-27 02:15:38 +03:00
|
|
|
*config = fnvlist_dup(spa->spa_config);
|
|
|
|
fnvlist_add_nvlist(*config,
|
2010-08-27 01:24:34 +04:00
|
|
|
ZPOOL_CONFIG_LOAD_INFO,
|
2021-10-27 02:15:38 +03:00
|
|
|
spa->spa_load_info);
|
2010-08-27 01:24:34 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
spa_unload(spa);
|
|
|
|
spa_deactivate(spa);
|
2010-05-29 00:45:14 +04:00
|
|
|
spa->spa_last_open_failed = error;
|
2008-11-20 23:01:55 +03:00
|
|
|
if (locked)
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
*spapp = NULL;
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
spa_open_ref(spa, tag);
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
if (config != NULL)
|
2008-11-20 23:01:55 +03:00
|
|
|
*config = spa_config_generate(spa, NULL, -1ULL, B_TRUE);
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
/*
|
|
|
|
* If we've recovered the pool, pass back any information we
|
|
|
|
* gathered while doing the load.
|
|
|
|
*/
|
|
|
|
if (state == SPA_LOAD_RECOVER) {
|
2021-10-27 02:15:38 +03:00
|
|
|
fnvlist_add_nvlist(*config, ZPOOL_CONFIG_LOAD_INFO,
|
|
|
|
spa->spa_load_info);
|
2010-08-27 01:24:34 +04:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (locked) {
|
|
|
|
spa->spa_last_open_failed = 0;
|
|
|
|
spa->spa_last_ubsync_txg = 0;
|
|
|
|
spa->spa_load_txg = 0;
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
}
|
|
|
|
|
2013-05-28 15:50:38 +04:00
|
|
|
if (firstopen)
|
async zvol minor node creation interferes with receive
When we finish a zfs receive, dmu_recv_end_sync() calls
zvol_create_minors(async=TRUE). This kicks off some other threads that
create the minor device nodes (in /dev/zvol/poolname/...). These async
threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which
both call dmu_objset_own(), which puts a "long hold" on the dataset.
Since the zvol minor node creation is asynchronous, this can happen
after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have
completed.
After the first receive ioctl has completed, userland may attempt to do
another receive into the same dataset (e.g. the next incremental
stream). This second receive and the asynchronous minor node creation
can interfere with one another in several different ways, because they
both require exclusive access to the dataset:
1. When the second receive is finishing up, dmu_recv_end_check() does
dsl_dataset_handoff_check(), which can fail with EBUSY if the async
minor node creation already has a "long hold" on this dataset. This
causes the 2nd receive to fail.
2. The async udev rule can fail if zvol_id and/or systemd-udevd try to
open the device while the the second receive's async attempt at minor
node creation owns the dataset (via zvol_prefetch_minors_impl). This
causes the minor node (/dev/zd*) to exist, but the udev-generated
/dev/zvol/... to not exist.
3. The async minor node creation can silently fail with EBUSY if the
first receive's zvol_create_minor() trys to own the dataset while the
second receive's zvol_prefetch_minors_impl already owns the dataset.
To address these problems, this change synchronously creates the minor
node. To avoid the lock ordering problems that the asynchrony was
introduced to fix (see #3681), we create the minor nodes from open
context, with no locks held, rather than from syncing contex as was
originally done.
Implementation notes:
We generally do not need to traverse children or prefetch anything (e.g.
when running the recv, snapshot, create, or clone subcommands of zfs).
We only need recursion when importing/opening a pool and when loading
encryption keys. The existing recursive, asynchronous, prefetching code
is preserved for use in these cases.
Channel programs may need to create zvol minor nodes, when creating a
snapshot of a zvol with the snapdev property set. We figure out what
snapshots are created when running the LUA program in syncing context.
In this case we need to remember what snapshots were created, and then
try to create their minor nodes from open context, after the LUA code
has completed.
There are additional zvol use cases that asynchronously own the dataset,
which can cause similar problems. E.g. changing the volmode or snapdev
properties. These are less problematic because they are not recursive
and don't touch datasets that are not involved in the operation, there
is still potential for interference with subsequent operations. In the
future, these cases should be similarly converted to create the zvol
minor node synchronously from open context.
The async tasks of removing and renaming minors do not own the objset,
so they do not have this problem. However, it may make sense to also
convert these operations to happen synchronously from open context, in
the future.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-65948
Closes #7863
Closes #9885
2020-02-03 20:33:14 +03:00
|
|
|
zvol_create_minors_recursive(spa_name(spa));
|
2013-05-28 15:50:38 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
*spapp = spa;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
int
|
2022-04-19 21:38:30 +03:00
|
|
|
spa_open_rewind(const char *name, spa_t **spapp, const void *tag,
|
|
|
|
nvlist_t *policy, nvlist_t **config)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
|
|
|
return (spa_open_common(name, spapp, tag, policy, config));
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
int
|
2022-04-19 21:38:30 +03:00
|
|
|
spa_open(const char *name, spa_t **spapp, const void *tag)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
return (spa_open_common(name, spapp, tag, NULL, NULL));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Lookup the given spa_t, incrementing the inject count in the process,
|
|
|
|
* preventing it from being exported or destroyed.
|
|
|
|
*/
|
|
|
|
spa_t *
|
|
|
|
spa_inject_addref(char *name)
|
|
|
|
{
|
|
|
|
spa_t *spa;
|
|
|
|
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
if ((spa = spa_lookup(name)) == NULL) {
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
spa->spa_inject_ref++;
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
|
|
|
|
return (spa);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
spa_inject_delref(spa_t *spa)
|
|
|
|
{
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
spa->spa_inject_ref--;
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Add spares device information to the nvlist.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
spa_add_spares(spa_t *spa, nvlist_t *config)
|
|
|
|
{
|
|
|
|
nvlist_t **spares;
|
|
|
|
uint_t i, nspares;
|
|
|
|
nvlist_t *nvroot;
|
|
|
|
uint64_t guid;
|
|
|
|
vdev_stat_t *vs;
|
|
|
|
uint_t vsc;
|
|
|
|
uint64_t pool;
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
ASSERT(spa_config_held(spa, SCL_CONFIG, RW_READER));
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (spa->spa_spares.sav_count == 0)
|
|
|
|
return;
|
|
|
|
|
2021-10-27 02:15:38 +03:00
|
|
|
nvroot = fnvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE);
|
|
|
|
VERIFY0(nvlist_lookup_nvlist_array(spa->spa_spares.sav_config,
|
|
|
|
ZPOOL_CONFIG_SPARES, &spares, &nspares));
|
2008-11-20 23:01:55 +03:00
|
|
|
if (nspares != 0) {
|
2021-12-07 04:19:13 +03:00
|
|
|
fnvlist_add_nvlist_array(nvroot, ZPOOL_CONFIG_SPARES,
|
|
|
|
(const nvlist_t * const *)spares, nspares);
|
2021-10-27 02:15:38 +03:00
|
|
|
VERIFY0(nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_SPARES,
|
|
|
|
&spares, &nspares));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Go through and find any spares which have since been
|
|
|
|
* repurposed as an active spare. If this is the case, update
|
|
|
|
* their status appropriately.
|
|
|
|
*/
|
|
|
|
for (i = 0; i < nspares; i++) {
|
2021-10-27 02:15:38 +03:00
|
|
|
guid = fnvlist_lookup_uint64(spares[i],
|
|
|
|
ZPOOL_CONFIG_GUID);
|
2008-12-03 23:09:06 +03:00
|
|
|
if (spa_spare_exists(guid, &pool, NULL) &&
|
|
|
|
pool != 0ULL) {
|
2021-10-27 02:15:38 +03:00
|
|
|
VERIFY0(nvlist_lookup_uint64_array(spares[i],
|
|
|
|
ZPOOL_CONFIG_VDEV_STATS, (uint64_t **)&vs,
|
|
|
|
&vsc));
|
2008-11-20 23:01:55 +03:00
|
|
|
vs->vs_state = VDEV_STATE_CANT_OPEN;
|
|
|
|
vs->vs_aux = VDEV_AUX_SPARED;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Add l2cache device information to the nvlist, including vdev stats.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
spa_add_l2cache(spa_t *spa, nvlist_t *config)
|
|
|
|
{
|
|
|
|
nvlist_t **l2cache;
|
|
|
|
uint_t i, j, nl2cache;
|
|
|
|
nvlist_t *nvroot;
|
|
|
|
uint64_t guid;
|
|
|
|
vdev_t *vd;
|
|
|
|
vdev_stat_t *vs;
|
|
|
|
uint_t vsc;
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
ASSERT(spa_config_held(spa, SCL_CONFIG, RW_READER));
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (spa->spa_l2cache.sav_count == 0)
|
|
|
|
return;
|
|
|
|
|
2021-10-27 02:15:38 +03:00
|
|
|
nvroot = fnvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE);
|
|
|
|
VERIFY0(nvlist_lookup_nvlist_array(spa->spa_l2cache.sav_config,
|
|
|
|
ZPOOL_CONFIG_L2CACHE, &l2cache, &nl2cache));
|
2008-11-20 23:01:55 +03:00
|
|
|
if (nl2cache != 0) {
|
2021-12-07 04:19:13 +03:00
|
|
|
fnvlist_add_nvlist_array(nvroot, ZPOOL_CONFIG_L2CACHE,
|
|
|
|
(const nvlist_t * const *)l2cache, nl2cache);
|
2021-10-27 02:15:38 +03:00
|
|
|
VERIFY0(nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_L2CACHE,
|
|
|
|
&l2cache, &nl2cache));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Update level 2 cache device stats.
|
|
|
|
*/
|
|
|
|
|
|
|
|
for (i = 0; i < nl2cache; i++) {
|
2021-10-27 02:15:38 +03:00
|
|
|
guid = fnvlist_lookup_uint64(l2cache[i],
|
|
|
|
ZPOOL_CONFIG_GUID);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
vd = NULL;
|
|
|
|
for (j = 0; j < spa->spa_l2cache.sav_count; j++) {
|
|
|
|
if (guid ==
|
|
|
|
spa->spa_l2cache.sav_vdevs[j]->vdev_guid) {
|
|
|
|
vd = spa->spa_l2cache.sav_vdevs[j];
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
ASSERT(vd != NULL);
|
|
|
|
|
2021-10-27 02:15:38 +03:00
|
|
|
VERIFY0(nvlist_lookup_uint64_array(l2cache[i],
|
|
|
|
ZPOOL_CONFIG_VDEV_STATS, (uint64_t **)&vs, &vsc));
|
2008-11-20 23:01:55 +03:00
|
|
|
vdev_get_stats(vd, vs);
|
2016-02-29 21:05:23 +03:00
|
|
|
vdev_config_generate_stats(vd, l2cache[i]);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-12-14 03:24:15 +04:00
|
|
|
static void
|
2015-02-26 23:24:11 +03:00
|
|
|
spa_feature_stats_from_disk(spa_t *spa, nvlist_t *features)
|
2012-12-14 03:24:15 +04:00
|
|
|
{
|
|
|
|
zap_cursor_t zc;
|
|
|
|
zap_attribute_t za;
|
|
|
|
|
|
|
|
if (spa->spa_feat_for_read_obj != 0) {
|
|
|
|
for (zap_cursor_init(&zc, spa->spa_meta_objset,
|
|
|
|
spa->spa_feat_for_read_obj);
|
|
|
|
zap_cursor_retrieve(&zc, &za) == 0;
|
|
|
|
zap_cursor_advance(&zc)) {
|
|
|
|
ASSERT(za.za_integer_length == sizeof (uint64_t) &&
|
|
|
|
za.za_num_integers == 1);
|
2015-02-26 23:24:11 +03:00
|
|
|
VERIFY0(nvlist_add_uint64(features, za.za_name,
|
2012-12-14 03:24:15 +04:00
|
|
|
za.za_first_integer));
|
|
|
|
}
|
|
|
|
zap_cursor_fini(&zc);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (spa->spa_feat_for_write_obj != 0) {
|
|
|
|
for (zap_cursor_init(&zc, spa->spa_meta_objset,
|
|
|
|
spa->spa_feat_for_write_obj);
|
|
|
|
zap_cursor_retrieve(&zc, &za) == 0;
|
|
|
|
zap_cursor_advance(&zc)) {
|
|
|
|
ASSERT(za.za_integer_length == sizeof (uint64_t) &&
|
|
|
|
za.za_num_integers == 1);
|
2015-02-26 23:24:11 +03:00
|
|
|
VERIFY0(nvlist_add_uint64(features, za.za_name,
|
2012-12-14 03:24:15 +04:00
|
|
|
za.za_first_integer));
|
|
|
|
}
|
|
|
|
zap_cursor_fini(&zc);
|
|
|
|
}
|
2015-02-26 23:24:11 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
spa_feature_stats_from_cache(spa_t *spa, nvlist_t *features)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < SPA_FEATURES; i++) {
|
|
|
|
zfeature_info_t feature = spa_feature_table[i];
|
|
|
|
uint64_t refcount;
|
|
|
|
|
|
|
|
if (feature_get_refcount(spa, &feature, &refcount) != 0)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
VERIFY0(nvlist_add_uint64(features, feature.fi_guid, refcount));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Store a list of pool features and their reference counts in the
|
|
|
|
* config.
|
|
|
|
*
|
|
|
|
* The first time this is called on a spa, allocate a new nvlist, fetch
|
|
|
|
* the pool features and reference counts from disk, then save the list
|
|
|
|
* in the spa. In subsequent calls on the same spa use the saved nvlist
|
|
|
|
* and refresh its values from the cached reference counts. This
|
|
|
|
* ensures we don't block here on I/O on a suspended pool so 'zpool
|
|
|
|
* clear' can resume the pool.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
spa_add_feature_stats(spa_t *spa, nvlist_t *config)
|
|
|
|
{
|
2015-04-23 22:32:59 +03:00
|
|
|
nvlist_t *features;
|
2015-02-26 23:24:11 +03:00
|
|
|
|
|
|
|
ASSERT(spa_config_held(spa, SCL_CONFIG, RW_READER));
|
|
|
|
|
2015-04-23 22:32:59 +03:00
|
|
|
mutex_enter(&spa->spa_feat_stats_lock);
|
|
|
|
features = spa->spa_feat_stats;
|
|
|
|
|
2015-02-26 23:24:11 +03:00
|
|
|
if (features != NULL) {
|
|
|
|
spa_feature_stats_from_cache(spa, features);
|
|
|
|
} else {
|
|
|
|
VERIFY0(nvlist_alloc(&features, NV_UNIQUE_NAME, KM_SLEEP));
|
|
|
|
spa->spa_feat_stats = features;
|
|
|
|
spa_feature_stats_from_disk(spa, features);
|
|
|
|
}
|
2012-12-14 03:24:15 +04:00
|
|
|
|
2015-02-26 23:24:11 +03:00
|
|
|
VERIFY0(nvlist_add_nvlist(config, ZPOOL_CONFIG_FEATURE_STATS,
|
|
|
|
features));
|
2015-04-23 22:32:59 +03:00
|
|
|
|
|
|
|
mutex_exit(&spa->spa_feat_stats_lock);
|
2012-12-14 03:24:15 +04:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
int
|
2012-12-14 03:24:15 +04:00
|
|
|
spa_get_stats(const char *name, nvlist_t **config,
|
|
|
|
char *altroot, size_t buflen)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
int error;
|
|
|
|
spa_t *spa;
|
|
|
|
|
|
|
|
*config = NULL;
|
2010-05-29 00:45:14 +04:00
|
|
|
error = spa_open_common(name, &spa, FTAG, NULL, config);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
if (spa != NULL) {
|
|
|
|
/*
|
|
|
|
* This still leaves a window of inconsistency where the spares
|
|
|
|
* or l2cache devices could change and the config would be
|
|
|
|
* self-inconsistent.
|
|
|
|
*/
|
|
|
|
spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
if (*config != NULL) {
|
2010-08-27 01:24:34 +04:00
|
|
|
uint64_t loadtimes[2];
|
|
|
|
|
|
|
|
loadtimes[0] = spa->spa_loaded_ts.tv_sec;
|
|
|
|
loadtimes[1] = spa->spa_loaded_ts.tv_nsec;
|
2021-10-27 02:15:38 +03:00
|
|
|
fnvlist_add_uint64_array(*config,
|
|
|
|
ZPOOL_CONFIG_LOADED_TIME, loadtimes, 2);
|
2010-08-27 01:24:34 +04:00
|
|
|
|
2021-10-27 02:15:38 +03:00
|
|
|
fnvlist_add_uint64(*config,
|
2009-07-03 02:44:48 +04:00
|
|
|
ZPOOL_CONFIG_ERRCOUNT,
|
2021-10-27 02:15:38 +03:00
|
|
|
spa_get_errlog_size(spa));
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2018-03-15 20:56:55 +03:00
|
|
|
if (spa_suspended(spa)) {
|
2021-10-27 02:15:38 +03:00
|
|
|
fnvlist_add_uint64(*config,
|
2009-07-03 02:44:48 +04:00
|
|
|
ZPOOL_CONFIG_SUSPENDED,
|
2021-10-27 02:15:38 +03:00
|
|
|
spa->spa_failmode);
|
|
|
|
fnvlist_add_uint64(*config,
|
2018-03-15 20:56:55 +03:00
|
|
|
ZPOOL_CONFIG_SUSPENDED_REASON,
|
2021-10-27 02:15:38 +03:00
|
|
|
spa->spa_suspended);
|
2018-03-15 20:56:55 +03:00
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
spa_add_spares(spa, *config);
|
|
|
|
spa_add_l2cache(spa, *config);
|
2012-12-14 03:24:15 +04:00
|
|
|
spa_add_feature_stats(spa, *config);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We want to get the alternate root even for faulted pools, so we cheat
|
|
|
|
* and call spa_lookup() directly.
|
|
|
|
*/
|
|
|
|
if (altroot) {
|
|
|
|
if (spa == NULL) {
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
spa = spa_lookup(name);
|
|
|
|
if (spa)
|
|
|
|
spa_altroot(spa, altroot, buflen);
|
|
|
|
else
|
|
|
|
altroot[0] = '\0';
|
|
|
|
spa = NULL;
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
} else {
|
|
|
|
spa_altroot(spa, altroot, buflen);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
if (spa != NULL) {
|
|
|
|
spa_config_exit(spa, SCL_CONFIG, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
spa_close(spa, FTAG);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Validate that the auxiliary device array is well formed. We must have an
|
|
|
|
* array of nvlists, each which describes a valid leaf vdev. If this is an
|
|
|
|
* import (mode is VDEV_ALLOC_SPARE), then we allow corrupted spares to be
|
|
|
|
* specified, as long as they are well-formed.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
spa_validate_aux_devs(spa_t *spa, nvlist_t *nvroot, uint64_t crtxg, int mode,
|
|
|
|
spa_aux_vdev_t *sav, const char *config, uint64_t version,
|
|
|
|
vdev_labeltype_t label)
|
|
|
|
{
|
|
|
|
nvlist_t **dev;
|
|
|
|
uint_t i, ndev;
|
|
|
|
vdev_t *vd;
|
|
|
|
int error;
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* It's acceptable to have no devs specified.
|
|
|
|
*/
|
|
|
|
if (nvlist_lookup_nvlist_array(nvroot, config, &dev, &ndev) != 0)
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
if (ndev == 0)
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(EINVAL));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Make sure the pool is formatted with a version that supports this
|
|
|
|
* device type.
|
|
|
|
*/
|
|
|
|
if (spa_version(spa) < version)
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(ENOTSUP));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Set the pending device list so we correctly handle device in-use
|
|
|
|
* checking.
|
|
|
|
*/
|
|
|
|
sav->sav_pending = dev;
|
|
|
|
sav->sav_npending = ndev;
|
|
|
|
|
|
|
|
for (i = 0; i < ndev; i++) {
|
|
|
|
if ((error = spa_config_parse(spa, &vd, dev[i], NULL, 0,
|
|
|
|
mode)) != 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (!vd->vdev_ops->vdev_op_leaf) {
|
|
|
|
vdev_free(vd);
|
2013-03-08 22:41:28 +04:00
|
|
|
error = SET_ERROR(EINVAL);
|
2008-11-20 23:01:55 +03:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
vd->vdev_top = vd;
|
|
|
|
|
|
|
|
if ((error = vdev_open(vd)) == 0 &&
|
|
|
|
(error = vdev_label_init(vd, crtxg, label)) == 0) {
|
2021-10-27 02:15:38 +03:00
|
|
|
fnvlist_add_uint64(dev[i], ZPOOL_CONFIG_GUID,
|
|
|
|
vd->vdev_guid);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
vdev_free(vd);
|
|
|
|
|
|
|
|
if (error &&
|
|
|
|
(mode != VDEV_ALLOC_SPARE && mode != VDEV_ALLOC_L2CACHE))
|
|
|
|
goto out;
|
|
|
|
else
|
|
|
|
error = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
out:
|
|
|
|
sav->sav_pending = NULL;
|
|
|
|
sav->sav_npending = 0;
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
spa_validate_aux(spa_t *spa, nvlist_t *nvroot, uint64_t crtxg, int mode)
|
|
|
|
{
|
|
|
|
int error;
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if ((error = spa_validate_aux_devs(spa, nvroot, crtxg, mode,
|
|
|
|
&spa->spa_spares, ZPOOL_CONFIG_SPARES, SPA_VERSION_SPARES,
|
|
|
|
VDEV_LABEL_SPARE)) != 0) {
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (spa_validate_aux_devs(spa, nvroot, crtxg, mode,
|
|
|
|
&spa->spa_l2cache, ZPOOL_CONFIG_L2CACHE, SPA_VERSION_L2CACHE,
|
|
|
|
VDEV_LABEL_L2CACHE));
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
spa_set_aux_vdevs(spa_aux_vdev_t *sav, nvlist_t **devs, int ndevs,
|
|
|
|
const char *config)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
if (sav->sav_config != NULL) {
|
|
|
|
nvlist_t **olddevs;
|
|
|
|
uint_t oldndevs;
|
|
|
|
nvlist_t **newdevs;
|
|
|
|
|
|
|
|
/*
|
2017-01-03 20:31:18 +03:00
|
|
|
* Generate new dev list by concatenating with the
|
2008-11-20 23:01:55 +03:00
|
|
|
* current dev list.
|
|
|
|
*/
|
2021-10-27 02:15:38 +03:00
|
|
|
VERIFY0(nvlist_lookup_nvlist_array(sav->sav_config, config,
|
|
|
|
&olddevs, &oldndevs));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
newdevs = kmem_alloc(sizeof (void *) *
|
2014-11-21 03:09:39 +03:00
|
|
|
(ndevs + oldndevs), KM_SLEEP);
|
2008-11-20 23:01:55 +03:00
|
|
|
for (i = 0; i < oldndevs; i++)
|
2021-10-27 02:15:38 +03:00
|
|
|
newdevs[i] = fnvlist_dup(olddevs[i]);
|
2008-11-20 23:01:55 +03:00
|
|
|
for (i = 0; i < ndevs; i++)
|
2021-10-27 02:15:38 +03:00
|
|
|
newdevs[i + oldndevs] = fnvlist_dup(devs[i]);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-10-27 02:15:38 +03:00
|
|
|
fnvlist_remove(sav->sav_config, config);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-12-07 04:19:13 +03:00
|
|
|
fnvlist_add_nvlist_array(sav->sav_config, config,
|
|
|
|
(const nvlist_t * const *)newdevs, ndevs + oldndevs);
|
2008-11-20 23:01:55 +03:00
|
|
|
for (i = 0; i < oldndevs + ndevs; i++)
|
|
|
|
nvlist_free(newdevs[i]);
|
|
|
|
kmem_free(newdevs, (oldndevs + ndevs) * sizeof (void *));
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Generate a new dev list.
|
|
|
|
*/
|
2021-10-27 02:15:38 +03:00
|
|
|
sav->sav_config = fnvlist_alloc();
|
2021-12-07 04:19:13 +03:00
|
|
|
fnvlist_add_nvlist_array(sav->sav_config, config,
|
|
|
|
(const nvlist_t * const *)devs, ndevs);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Stop and drop level 2 ARC devices
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
spa_l2cache_drop(spa_t *spa)
|
|
|
|
{
|
|
|
|
vdev_t *vd;
|
|
|
|
int i;
|
|
|
|
spa_aux_vdev_t *sav = &spa->spa_l2cache;
|
|
|
|
|
|
|
|
for (i = 0; i < sav->sav_count; i++) {
|
|
|
|
uint64_t pool;
|
|
|
|
|
|
|
|
vd = sav->sav_vdevs[i];
|
|
|
|
ASSERT(vd != NULL);
|
|
|
|
|
2009-01-16 00:59:39 +03:00
|
|
|
if (spa_l2cache_exists(vd->vdev_guid, &pool) &&
|
|
|
|
pool != 0ULL && l2arc_vdev_present(vd))
|
2008-11-20 23:01:55 +03:00
|
|
|
l2arc_remove_vdev(vd);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
/*
|
|
|
|
* Verify encryption parameters for spa creation. If we are encrypting, we must
|
|
|
|
* have the encryption feature flag enabled.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
spa_create_check_encryption_params(dsl_crypto_params_t *dcp,
|
|
|
|
boolean_t has_encryption)
|
|
|
|
{
|
|
|
|
if (dcp->cp_crypt != ZIO_CRYPT_OFF &&
|
|
|
|
dcp->cp_crypt != ZIO_CRYPT_INHERIT &&
|
|
|
|
!has_encryption)
|
|
|
|
return (SET_ERROR(ENOTSUP));
|
|
|
|
|
2018-06-18 22:47:12 +03:00
|
|
|
return (dmu_objset_create_crypt_check(NULL, dcp, NULL));
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Pool Creation
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
spa_create(const char *pool, nvlist_t *nvroot, nvlist_t *props,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
nvlist_t *zplprops, dsl_crypto_params_t *dcp)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
spa_t *spa;
|
|
|
|
char *altroot = NULL;
|
|
|
|
vdev_t *rvd;
|
|
|
|
dsl_pool_t *dp;
|
|
|
|
dmu_tx_t *tx;
|
2009-07-03 02:44:48 +04:00
|
|
|
int error = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t txg = TXG_INITIAL;
|
|
|
|
nvlist_t **spares, **l2cache;
|
|
|
|
uint_t nspares, nl2cache;
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
uint64_t version, obj, ndraid = 0;
|
2012-12-14 03:24:15 +04:00
|
|
|
boolean_t has_features;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
boolean_t has_encryption;
|
2019-10-11 02:39:41 +03:00
|
|
|
boolean_t has_allocclass;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
spa_feature_t feat;
|
|
|
|
char *feat_name;
|
2014-06-21 03:00:11 +04:00
|
|
|
char *poolname;
|
|
|
|
nvlist_t *nvl;
|
|
|
|
|
2018-09-06 04:33:36 +03:00
|
|
|
if (props == NULL ||
|
|
|
|
nvlist_lookup_string(props, "tname", &poolname) != 0)
|
2014-06-21 03:00:11 +04:00
|
|
|
poolname = (char *)pool;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If this pool already exists, return failure.
|
|
|
|
*/
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
2014-06-21 03:00:11 +04:00
|
|
|
if (spa_lookup(poolname) != NULL) {
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(&spa_namespace_lock);
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(EEXIST));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Allocate a new spa_t structure.
|
|
|
|
*/
|
2014-06-21 03:00:11 +04:00
|
|
|
nvl = fnvlist_alloc();
|
|
|
|
fnvlist_add_string(nvl, ZPOOL_CONFIG_POOL_NAME, pool);
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) nvlist_lookup_string(props,
|
|
|
|
zpool_prop_to_name(ZPOOL_PROP_ALTROOT), &altroot);
|
2014-06-21 03:00:11 +04:00
|
|
|
spa = spa_add(poolname, nvl, altroot);
|
|
|
|
fnvlist_free(nvl);
|
2009-01-16 00:59:39 +03:00
|
|
|
spa_activate(spa, spa_mode_global);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (props && (error = spa_prop_validate(spa, props))) {
|
|
|
|
spa_deactivate(spa);
|
|
|
|
spa_remove(spa);
|
2008-12-03 23:09:06 +03:00
|
|
|
mutex_exit(&spa_namespace_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2014-06-21 03:00:11 +04:00
|
|
|
/*
|
|
|
|
* Temporary pool names should never be written to disk.
|
|
|
|
*/
|
|
|
|
if (poolname != pool)
|
|
|
|
spa->spa_import_flags |= ZFS_IMPORT_TEMP_NAME;
|
|
|
|
|
2012-12-14 03:24:15 +04:00
|
|
|
has_features = B_FALSE;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
has_encryption = B_FALSE;
|
2019-10-11 02:39:41 +03:00
|
|
|
has_allocclass = B_FALSE;
|
2017-11-04 23:25:13 +03:00
|
|
|
for (nvpair_t *elem = nvlist_next_nvpair(props, NULL);
|
2012-12-14 03:24:15 +04:00
|
|
|
elem != NULL; elem = nvlist_next_nvpair(props, elem)) {
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (zpool_prop_feature(nvpair_name(elem))) {
|
2012-12-14 03:24:15 +04:00
|
|
|
has_features = B_TRUE;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
|
|
|
feat_name = strchr(nvpair_name(elem), '@') + 1;
|
|
|
|
VERIFY0(zfeature_lookup_name(feat_name, &feat));
|
|
|
|
if (feat == SPA_FEATURE_ENCRYPTION)
|
|
|
|
has_encryption = B_TRUE;
|
2019-10-11 02:39:41 +03:00
|
|
|
if (feat == SPA_FEATURE_ALLOCATION_CLASSES)
|
|
|
|
has_allocclass = B_TRUE;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* verify encryption params, if they were provided */
|
|
|
|
if (dcp != NULL) {
|
|
|
|
error = spa_create_check_encryption_params(dcp, has_encryption);
|
|
|
|
if (error != 0) {
|
|
|
|
spa_deactivate(spa);
|
|
|
|
spa_remove(spa);
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
return (error);
|
|
|
|
}
|
2012-12-14 03:24:15 +04:00
|
|
|
}
|
2020-01-07 02:40:06 +03:00
|
|
|
if (!has_allocclass && zfs_special_devs(nvroot, NULL)) {
|
2019-10-11 02:39:41 +03:00
|
|
|
spa_deactivate(spa);
|
|
|
|
spa_remove(spa);
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
return (ENOTSUP);
|
|
|
|
}
|
2012-12-14 03:24:15 +04:00
|
|
|
|
|
|
|
if (has_features || nvlist_lookup_uint64(props,
|
|
|
|
zpool_prop_to_name(ZPOOL_PROP_VERSION), &version) != 0) {
|
2008-11-20 23:01:55 +03:00
|
|
|
version = SPA_VERSION;
|
2012-12-14 03:24:15 +04:00
|
|
|
}
|
|
|
|
ASSERT(SPA_VERSION_IS_SUPPORTED(version));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
spa->spa_first_txg = txg;
|
|
|
|
spa->spa_uberblock.ub_txg = txg - 1;
|
2008-11-20 23:01:55 +03:00
|
|
|
spa->spa_uberblock.ub_version = version;
|
|
|
|
spa->spa_ubsync = spa->spa_uberblock;
|
2016-10-14 03:59:18 +03:00
|
|
|
spa->spa_load_state = SPA_LOAD_CREATE;
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
spa->spa_removing_phys.sr_state = DSS_NONE;
|
|
|
|
spa->spa_removing_phys.sr_removing_vdev = -1;
|
|
|
|
spa->spa_removing_phys.sr_prev_indirect_vdev = -1;
|
2019-04-05 20:12:02 +03:00
|
|
|
spa->spa_indirect_vdevs_loaded = B_TRUE;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
/*
|
|
|
|
* Create "The Godfather" zio to hold all async IOs
|
|
|
|
*/
|
2014-09-17 10:59:43 +04:00
|
|
|
spa->spa_async_zio_root = kmem_alloc(max_ncpus * sizeof (void *),
|
|
|
|
KM_SLEEP);
|
2017-11-04 23:25:13 +03:00
|
|
|
for (int i = 0; i < max_ncpus; i++) {
|
2014-09-17 10:59:43 +04:00
|
|
|
spa->spa_async_zio_root[i] = zio_root(spa, NULL, NULL,
|
|
|
|
ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE |
|
|
|
|
ZIO_FLAG_GODFATHER);
|
|
|
|
}
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Create the root vdev.
|
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
error = spa_config_parse(spa, &rvd, nvroot, NULL, 0, VDEV_ALLOC_ADD);
|
|
|
|
|
|
|
|
ASSERT(error != 0 || rvd != NULL);
|
|
|
|
ASSERT(error != 0 || spa->spa_root_vdev == rvd);
|
|
|
|
|
|
|
|
if (error == 0 && !zfs_allocatable_devs(nvroot))
|
2013-03-08 22:41:28 +04:00
|
|
|
error = SET_ERROR(EINVAL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (error == 0 &&
|
|
|
|
(error = vdev_create(rvd, txg, B_FALSE)) == 0 &&
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
(error = vdev_draid_spare_create(nvroot, rvd, &ndraid, 0)) == 0 &&
|
|
|
|
(error = spa_validate_aux(spa, nvroot, txg, VDEV_ALLOC_ADD)) == 0) {
|
2018-09-06 04:33:36 +03:00
|
|
|
/*
|
|
|
|
* instantiate the metaslab groups (this will dirty the vdevs)
|
|
|
|
* we can no longer error exit past this point
|
|
|
|
*/
|
|
|
|
for (int c = 0; error == 0 && c < rvd->vdev_children; c++) {
|
|
|
|
vdev_t *vd = rvd->vdev_child[c];
|
|
|
|
|
|
|
|
vdev_metaslab_set_size(vd);
|
|
|
|
vdev_expand(vd, txg);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (error != 0) {
|
|
|
|
spa_unload(spa);
|
|
|
|
spa_deactivate(spa);
|
|
|
|
spa_remove(spa);
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Get the list of spares, if specified.
|
|
|
|
*/
|
|
|
|
if (nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_SPARES,
|
|
|
|
&spares, &nspares) == 0) {
|
2021-10-27 02:15:38 +03:00
|
|
|
spa->spa_spares.sav_config = fnvlist_alloc();
|
|
|
|
fnvlist_add_nvlist_array(spa->spa_spares.sav_config,
|
2021-12-07 04:19:13 +03:00
|
|
|
ZPOOL_CONFIG_SPARES, (const nvlist_t * const *)spares,
|
|
|
|
nspares);
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
|
2008-11-20 23:01:55 +03:00
|
|
|
spa_load_spares(spa);
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
spa->spa_spares.sav_sync = B_TRUE;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Get the list of level 2 cache devices, if specified.
|
|
|
|
*/
|
|
|
|
if (nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_L2CACHE,
|
|
|
|
&l2cache, &nl2cache) == 0) {
|
2021-12-07 04:19:13 +03:00
|
|
|
VERIFY0(nvlist_alloc(&spa->spa_l2cache.sav_config,
|
|
|
|
NV_UNIQUE_NAME, KM_SLEEP));
|
2021-10-27 02:15:38 +03:00
|
|
|
fnvlist_add_nvlist_array(spa->spa_l2cache.sav_config,
|
2021-12-07 04:19:13 +03:00
|
|
|
ZPOOL_CONFIG_L2CACHE, (const nvlist_t * const *)l2cache,
|
|
|
|
nl2cache);
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
|
2008-11-20 23:01:55 +03:00
|
|
|
spa_load_l2cache(spa);
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
spa->spa_l2cache.sav_sync = B_TRUE;
|
|
|
|
}
|
|
|
|
|
2012-12-14 03:24:15 +04:00
|
|
|
spa->spa_is_initializing = B_TRUE;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
spa->spa_dsl_pool = dp = dsl_pool_create(spa, zplprops, dcp, txg);
|
2012-12-14 03:24:15 +04:00
|
|
|
spa->spa_is_initializing = B_FALSE;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Create DDTs (dedup tables).
|
|
|
|
*/
|
|
|
|
ddt_create(spa);
|
|
|
|
|
|
|
|
spa_update_dspace(spa);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
tx = dmu_tx_create_assigned(dp, txg);
|
|
|
|
|
2017-10-23 19:45:59 +03:00
|
|
|
/*
|
|
|
|
* Create the pool's history object.
|
|
|
|
*/
|
|
|
|
if (version >= SPA_VERSION_ZPOOL_HISTORY && !spa->spa_history)
|
|
|
|
spa_history_create_obj(spa, tx);
|
|
|
|
|
|
|
|
spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_CREATE);
|
|
|
|
spa_history_log_version(spa, "create", tx);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Create the pool config object.
|
|
|
|
*/
|
|
|
|
spa->spa_config_object = dmu_object_alloc(spa->spa_meta_objset,
|
2008-12-03 23:09:06 +03:00
|
|
|
DMU_OT_PACKED_NVLIST, SPA_CONFIG_BLOCKSIZE,
|
2008-11-20 23:01:55 +03:00
|
|
|
DMU_OT_PACKED_NVLIST_SIZE, sizeof (uint64_t), tx);
|
|
|
|
|
|
|
|
if (zap_add(spa->spa_meta_objset,
|
|
|
|
DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_CONFIG,
|
|
|
|
sizeof (uint64_t), 1, &spa->spa_config_object, tx) != 0) {
|
|
|
|
cmn_err(CE_PANIC, "failed to add pool config");
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (zap_add(spa->spa_meta_objset,
|
|
|
|
DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_CREATION_VERSION,
|
|
|
|
sizeof (uint64_t), 1, &version, tx) != 0) {
|
|
|
|
cmn_err(CE_PANIC, "failed to add pool version");
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/* Newly created pools with the right version are always deflated. */
|
|
|
|
if (version >= SPA_VERSION_RAIDZ_DEFLATE) {
|
|
|
|
spa->spa_deflate = TRUE;
|
|
|
|
if (zap_add(spa->spa_meta_objset,
|
|
|
|
DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_DEFLATE,
|
|
|
|
sizeof (uint64_t), 1, &spa->spa_deflate, tx) != 0) {
|
|
|
|
cmn_err(CE_PANIC, "failed to add deflate");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Create the deferred-free bpobj. Turn off compression
|
2008-11-20 23:01:55 +03:00
|
|
|
* because sync-to-convergence takes longer if the blocksize
|
|
|
|
* keeps changing.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
obj = bpobj_alloc(spa->spa_meta_objset, 1 << 14, tx);
|
|
|
|
dmu_object_set_compress(spa->spa_meta_objset, obj,
|
2008-11-20 23:01:55 +03:00
|
|
|
ZIO_COMPRESS_OFF, tx);
|
|
|
|
if (zap_add(spa->spa_meta_objset,
|
2010-05-29 00:45:14 +04:00
|
|
|
DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_SYNC_BPOBJ,
|
|
|
|
sizeof (uint64_t), 1, &obj, tx) != 0) {
|
|
|
|
cmn_err(CE_PANIC, "failed to add bpobj");
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
VERIFY3U(0, ==, bpobj_open(&spa->spa_deferred_bpobj,
|
|
|
|
spa->spa_meta_objset, obj));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-16 01:47:05 +03:00
|
|
|
/*
|
|
|
|
* Generate some random noise for salted checksums to operate on.
|
|
|
|
*/
|
|
|
|
(void) random_get_pseudo_bytes(spa->spa_cksum_salt.zcs_bytes,
|
|
|
|
sizeof (spa->spa_cksum_salt.zcs_bytes));
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Set pool properties.
|
|
|
|
*/
|
|
|
|
spa->spa_bootfs = zpool_prop_default_numeric(ZPOOL_PROP_BOOTFS);
|
|
|
|
spa->spa_delegation = zpool_prop_default_numeric(ZPOOL_PROP_DELEGATION);
|
|
|
|
spa->spa_failmode = zpool_prop_default_numeric(ZPOOL_PROP_FAILUREMODE);
|
2009-07-03 02:44:48 +04:00
|
|
|
spa->spa_autoexpand = zpool_prop_default_numeric(ZPOOL_PROP_AUTOEXPAND);
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
spa->spa_multihost = zpool_prop_default_numeric(ZPOOL_PROP_MULTIHOST);
|
2019-03-29 19:13:20 +03:00
|
|
|
spa->spa_autotrim = zpool_prop_default_numeric(ZPOOL_PROP_AUTOTRIM);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
if (props != NULL) {
|
|
|
|
spa_configfile_set(spa, props, B_FALSE);
|
2013-09-04 16:00:57 +04:00
|
|
|
spa_sync_props(props, tx);
|
2009-02-18 23:51:31 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
for (int i = 0; i < ndraid; i++)
|
|
|
|
spa_feature_incr(spa, SPA_FEATURE_DRAID, tx);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
dmu_tx_commit(tx);
|
|
|
|
|
|
|
|
spa->spa_sync_on = B_TRUE;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
txg_sync_start(dp);
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
mmp_thread_start(spa);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
txg_wait_synced(dp, txg);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
OpenZFS 9079 - race condition in starting and ending condensing thread for indirect vdevs
The timeline of the race condition is the following:
[1] Thread A is about to finish condesing the first vdev in
spa_condense_indirect_thread(), so it calls the
spa_condense_indirect_complete_sync() sync task which sets
the spa_condensing_indirect field to NULL. Waiting for the
sync task to finish, thread A sleeps until the txg is done.
When this happens, thread A will acquire spa_async_lock and
set spa_condense_thread to NULL.
[2] While thread A waits for the txg to finish, thread B which is
running spa_sync() checks whether it should condense the
second vdev in vdev_indirect_should_condense() by checking the
spa_condensing_indirect field which was set to NULL by
spa_condense_indirect_thread() from thread A. So it goes on
and tries to spawn a new condensing thread in
spa_condense_indirect_start_sync() and the aforementioned
assertions fails because thread A has not set spa_condense_thread
to NULL (which is basically the last thing it does before returning).
The main issue here is that we rely on both spa_condensing_indirect
and spa_condense_thread to signify whether a condensing thread is
running. Ideally we would only use one throughout the codebase. In
addition, for managing spa_condense_thread we currently use
spa_async_lock which basically tights condensing to scrubing when
it comes to pausing and resuming those actions during spa export.
This commit introduces the ZTHR infrastructure, which is basically
threads created during spa_load()/spa_create() and exist until we
export or destroy the pool. ZTHRs sleep the majority of the time,
until they are notified to wake up and do some predefined type of work.
In the context of the current bug, a zthr to does the condensing of
indirect mappings replacing the older code that used bare kthreads.
When a pool is created, the condensing zthr is spawned but sleeps
right away, until it is awaken by a signal from spa_sync(). If an
existing pool is loaded, the condensing zthr looks if there is
anything to condense before going to sleep, in case we were condensing
mappings in the pool before it got exported.
The benefits of this solution are the following:
- The current bug is fixed
- spa_condensing_indirect is the sole indicator of whether we are
currently condensing or not
- condensing is more decoupled from the spa_async_thread related
functionality.
As a final note, this commit also sets up the path on upstreaming
other features that use the ZTHR code like zpool checkpoint and
fast clone deletion.
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9079
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3dc606ee
Closes #6900
2017-03-16 02:41:52 +03:00
|
|
|
spa_spawn_aux_threads(spa);
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
spa_write_cachefile(spa, B_FALSE, B_TRUE);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-04-02 06:44:32 +03:00
|
|
|
/*
|
|
|
|
* Don't count references from objsets that are already closed
|
|
|
|
* and are making their way through the eviction process.
|
|
|
|
*/
|
|
|
|
spa_evicting_os_wait(spa);
|
2018-10-01 20:42:05 +03:00
|
|
|
spa->spa_minref = zfs_refcount_count(&spa->spa_refcount);
|
2016-10-14 03:59:18 +03:00
|
|
|
spa->spa_load_state = SPA_LOAD_NONE;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2022-02-16 02:54:25 +03:00
|
|
|
spa_import_os(spa);
|
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
/*
|
|
|
|
* Import a non-root pool into the system.
|
|
|
|
*/
|
|
|
|
int
|
2013-09-04 16:00:57 +04:00
|
|
|
spa_import(char *pool, nvlist_t *config, nvlist_t *props, uint64_t flags)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
spa_t *spa;
|
|
|
|
char *altroot = NULL;
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_load_state_t state = SPA_LOAD_IMPORT;
|
2017-02-11 01:51:09 +03:00
|
|
|
zpool_load_policy_t policy;
|
2019-11-21 20:32:57 +03:00
|
|
|
spa_mode_t mode = spa_mode_global;
|
2010-08-27 01:24:34 +04:00
|
|
|
uint64_t readonly = B_FALSE;
|
2009-07-03 02:44:48 +04:00
|
|
|
int error;
|
2008-11-20 23:01:55 +03:00
|
|
|
nvlist_t *nvroot;
|
|
|
|
nvlist_t **spares, **l2cache;
|
|
|
|
uint_t nspares, nl2cache;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If a pool with this name exists, return failure.
|
|
|
|
*/
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (spa_lookup(pool) != NULL) {
|
2009-07-03 02:44:48 +04:00
|
|
|
mutex_exit(&spa_namespace_lock);
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(EEXIST));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Create and initialize the spa structure.
|
|
|
|
*/
|
|
|
|
(void) nvlist_lookup_string(props,
|
|
|
|
zpool_prop_to_name(ZPOOL_PROP_ALTROOT), &altroot);
|
2010-08-27 01:24:34 +04:00
|
|
|
(void) nvlist_lookup_uint64(props,
|
|
|
|
zpool_prop_to_name(ZPOOL_PROP_READONLY), &readonly);
|
|
|
|
if (readonly)
|
2019-11-21 20:32:57 +03:00
|
|
|
mode = SPA_MODE_READ;
|
2010-05-29 00:45:14 +04:00
|
|
|
spa = spa_add(pool, config, altroot);
|
2010-08-27 01:24:34 +04:00
|
|
|
spa->spa_import_flags = flags;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verbatim import - Take a pool and insert it into the namespace
|
|
|
|
* as if it had been loaded at boot.
|
|
|
|
*/
|
|
|
|
if (spa->spa_import_flags & ZFS_IMPORT_VERBATIM) {
|
|
|
|
if (props != NULL)
|
|
|
|
spa_configfile_set(spa, props, B_FALSE);
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
spa_write_cachefile(spa, B_FALSE, B_TRUE);
|
2017-05-30 21:39:17 +03:00
|
|
|
spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_IMPORT);
|
2016-03-10 18:16:02 +03:00
|
|
|
zfs_dbgmsg("spa_import: verbatim import of %s", pool);
|
2010-08-27 01:24:34 +04:00
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
spa_activate(spa, mode);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
/*
|
|
|
|
* Don't start async tasks until we know everything is healthy.
|
|
|
|
*/
|
|
|
|
spa_async_suspend(spa);
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2017-02-11 01:51:09 +03:00
|
|
|
zpool_get_load_policy(config, &policy);
|
|
|
|
if (policy.zlp_rewind & ZPOOL_DO_REWIND)
|
2010-08-27 01:24:34 +04:00
|
|
|
state = SPA_LOAD_RECOVER;
|
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
spa->spa_config_source = SPA_CONFIG_SRC_TRYIMPORT;
|
2010-08-27 01:24:34 +04:00
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
if (state != SPA_LOAD_RECOVER) {
|
|
|
|
spa->spa_last_ubsync_txg = spa->spa_load_txg = 0;
|
|
|
|
zfs_dbgmsg("spa_import: importing %s", pool);
|
|
|
|
} else {
|
|
|
|
zfs_dbgmsg("spa_import: importing %s, max_txg=%lld "
|
2017-02-11 01:51:09 +03:00
|
|
|
"(RECOVERY MODE)", pool, (longlong_t)policy.zlp_txg);
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
}
|
2017-02-11 01:51:09 +03:00
|
|
|
error = spa_load_best(spa, state, policy.zlp_txg, policy.zlp_rewind);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
/*
|
2010-08-27 01:24:34 +04:00
|
|
|
* Propagate anything learned while loading the pool and pass it
|
|
|
|
* back to caller (i.e. rewind info, missing devices, etc).
|
2010-05-29 00:45:14 +04:00
|
|
|
*/
|
2021-10-27 02:15:38 +03:00
|
|
|
fnvlist_add_nvlist(config, ZPOOL_CONFIG_LOAD_INFO, spa->spa_load_info);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2009-07-03 02:44:48 +04:00
|
|
|
* Toss any existing sparelist, as it doesn't have any validity
|
|
|
|
* anymore, and conflicts with spa_has_spare().
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2009-07-03 02:44:48 +04:00
|
|
|
if (spa->spa_spares.sav_config) {
|
2008-11-20 23:01:55 +03:00
|
|
|
nvlist_free(spa->spa_spares.sav_config);
|
|
|
|
spa->spa_spares.sav_config = NULL;
|
|
|
|
spa_load_spares(spa);
|
|
|
|
}
|
2009-07-03 02:44:48 +04:00
|
|
|
if (spa->spa_l2cache.sav_config) {
|
2008-11-20 23:01:55 +03:00
|
|
|
nvlist_free(spa->spa_l2cache.sav_config);
|
|
|
|
spa->spa_l2cache.sav_config = NULL;
|
|
|
|
spa_load_l2cache(spa);
|
|
|
|
}
|
|
|
|
|
2021-10-27 02:15:38 +03:00
|
|
|
nvroot = fnvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE);
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
if (props != NULL)
|
|
|
|
spa_configfile_set(spa, props, B_FALSE);
|
|
|
|
|
2009-01-16 00:59:39 +03:00
|
|
|
if (error != 0 || (props && spa_writeable(spa) &&
|
|
|
|
(error = spa_prop_set(spa, props)))) {
|
2009-07-03 02:44:48 +04:00
|
|
|
spa_unload(spa);
|
|
|
|
spa_deactivate(spa);
|
|
|
|
spa_remove(spa);
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
spa_async_resume(spa);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Override any spares and level 2 cache devices as specified by
|
|
|
|
* the user, as these may have correct device names/devids, etc.
|
|
|
|
*/
|
|
|
|
if (nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_SPARES,
|
|
|
|
&spares, &nspares) == 0) {
|
|
|
|
if (spa->spa_spares.sav_config)
|
2021-10-27 02:15:38 +03:00
|
|
|
fnvlist_remove(spa->spa_spares.sav_config,
|
|
|
|
ZPOOL_CONFIG_SPARES);
|
2008-11-20 23:01:55 +03:00
|
|
|
else
|
2021-10-27 02:15:38 +03:00
|
|
|
spa->spa_spares.sav_config = fnvlist_alloc();
|
|
|
|
fnvlist_add_nvlist_array(spa->spa_spares.sav_config,
|
2021-12-07 04:19:13 +03:00
|
|
|
ZPOOL_CONFIG_SPARES, (const nvlist_t * const *)spares,
|
|
|
|
nspares);
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
|
2008-11-20 23:01:55 +03:00
|
|
|
spa_load_spares(spa);
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
spa->spa_spares.sav_sync = B_TRUE;
|
|
|
|
}
|
|
|
|
if (nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_L2CACHE,
|
|
|
|
&l2cache, &nl2cache) == 0) {
|
|
|
|
if (spa->spa_l2cache.sav_config)
|
2021-10-27 02:15:38 +03:00
|
|
|
fnvlist_remove(spa->spa_l2cache.sav_config,
|
|
|
|
ZPOOL_CONFIG_L2CACHE);
|
2008-11-20 23:01:55 +03:00
|
|
|
else
|
2021-10-27 02:15:38 +03:00
|
|
|
spa->spa_l2cache.sav_config = fnvlist_alloc();
|
|
|
|
fnvlist_add_nvlist_array(spa->spa_l2cache.sav_config,
|
2021-12-07 04:19:13 +03:00
|
|
|
ZPOOL_CONFIG_L2CACHE, (const nvlist_t * const *)l2cache,
|
|
|
|
nl2cache);
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
|
2008-11-20 23:01:55 +03:00
|
|
|
spa_load_l2cache(spa);
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
spa->spa_l2cache.sav_sync = B_TRUE;
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Check for any removed devices.
|
|
|
|
*/
|
|
|
|
if (spa->spa_autoreplace) {
|
|
|
|
spa_aux_check_removed(&spa->spa_spares);
|
|
|
|
spa_aux_check_removed(&spa->spa_l2cache);
|
|
|
|
}
|
|
|
|
|
2009-01-16 00:59:39 +03:00
|
|
|
if (spa_writeable(spa)) {
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* Update the config cache to include the newly-imported pool.
|
|
|
|
*/
|
2009-08-18 22:43:27 +04:00
|
|
|
spa_config_update(spa, SPA_CONFIG_UPDATE_POOL);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2009-07-03 02:44:48 +04:00
|
|
|
* It's possible that the pool was expanded while it was exported.
|
|
|
|
* We kick off an async task to handle this for us.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2009-07-03 02:44:48 +04:00
|
|
|
spa_async_request(spa, SPA_ASYNC_AUTOEXPAND);
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2017-10-23 19:45:59 +03:00
|
|
|
spa_history_log_version(spa, "import", NULL);
|
2016-07-28 01:29:15 +03:00
|
|
|
|
2017-05-30 21:39:17 +03:00
|
|
|
spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_IMPORT);
|
2016-07-28 01:29:15 +03:00
|
|
|
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
|
async zvol minor node creation interferes with receive
When we finish a zfs receive, dmu_recv_end_sync() calls
zvol_create_minors(async=TRUE). This kicks off some other threads that
create the minor device nodes (in /dev/zvol/poolname/...). These async
threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which
both call dmu_objset_own(), which puts a "long hold" on the dataset.
Since the zvol minor node creation is asynchronous, this can happen
after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have
completed.
After the first receive ioctl has completed, userland may attempt to do
another receive into the same dataset (e.g. the next incremental
stream). This second receive and the asynchronous minor node creation
can interfere with one another in several different ways, because they
both require exclusive access to the dataset:
1. When the second receive is finishing up, dmu_recv_end_check() does
dsl_dataset_handoff_check(), which can fail with EBUSY if the async
minor node creation already has a "long hold" on this dataset. This
causes the 2nd receive to fail.
2. The async udev rule can fail if zvol_id and/or systemd-udevd try to
open the device while the the second receive's async attempt at minor
node creation owns the dataset (via zvol_prefetch_minors_impl). This
causes the minor node (/dev/zd*) to exist, but the udev-generated
/dev/zvol/... to not exist.
3. The async minor node creation can silently fail with EBUSY if the
first receive's zvol_create_minor() trys to own the dataset while the
second receive's zvol_prefetch_minors_impl already owns the dataset.
To address these problems, this change synchronously creates the minor
node. To avoid the lock ordering problems that the asynchrony was
introduced to fix (see #3681), we create the minor nodes from open
context, with no locks held, rather than from syncing contex as was
originally done.
Implementation notes:
We generally do not need to traverse children or prefetch anything (e.g.
when running the recv, snapshot, create, or clone subcommands of zfs).
We only need recursion when importing/opening a pool and when loading
encryption keys. The existing recursive, asynchronous, prefetching code
is preserved for use in these cases.
Channel programs may need to create zvol minor nodes, when creating a
snapshot of a zvol with the snapdev property set. We figure out what
snapshots are created when running the LUA program in syncing context.
In this case we need to remember what snapshots were created, and then
try to create their minor nodes from open context, after the LUA code
has completed.
There are additional zvol use cases that asynchronously own the dataset,
which can cause similar problems. E.g. changing the volmode or snapdev
properties. These are less problematic because they are not recursive
and don't touch datasets that are not involved in the operation, there
is still potential for interference with subsequent operations. In the
future, these cases should be similarly converted to create the zvol
minor node synchronously from open context.
The async tasks of removing and renaming minors do not own the objset,
so they do not have this problem. However, it may make sense to also
convert these operations to happen synchronously from open context, in
the future.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-65948
Closes #7863
Closes #9885
2020-02-03 20:33:14 +03:00
|
|
|
zvol_create_minors_recursive(pool);
|
2019-10-28 19:51:53 +03:00
|
|
|
|
2022-02-16 02:54:25 +03:00
|
|
|
spa_import_os(spa);
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
nvlist_t *
|
|
|
|
spa_tryimport(nvlist_t *tryconfig)
|
|
|
|
{
|
|
|
|
nvlist_t *config = NULL;
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
char *poolname, *cachefile;
|
2008-11-20 23:01:55 +03:00
|
|
|
spa_t *spa;
|
|
|
|
uint64_t state;
|
2009-02-18 23:51:31 +03:00
|
|
|
int error;
|
2017-02-11 01:51:09 +03:00
|
|
|
zpool_load_policy_t policy;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (nvlist_lookup_string(tryconfig, ZPOOL_CONFIG_POOL_NAME, &poolname))
|
|
|
|
return (NULL);
|
|
|
|
|
|
|
|
if (nvlist_lookup_uint64(tryconfig, ZPOOL_CONFIG_POOL_STATE, &state))
|
|
|
|
return (NULL);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Create and initialize the spa structure.
|
|
|
|
*/
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
spa = spa_add(TRYIMPORT_NAME, tryconfig, NULL);
|
2019-11-21 20:32:57 +03:00
|
|
|
spa_activate(spa, SPA_MODE_READ);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2017-02-11 01:51:09 +03:00
|
|
|
* Rewind pool if a max txg was provided.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2017-02-11 01:51:09 +03:00
|
|
|
zpool_get_load_policy(spa->spa_config, &policy);
|
|
|
|
if (policy.zlp_txg != UINT64_MAX) {
|
|
|
|
spa->spa_load_max_txg = policy.zlp_txg;
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
spa->spa_extreme_rewind = B_TRUE;
|
|
|
|
zfs_dbgmsg("spa_tryimport: importing %s, max_txg=%lld",
|
2017-02-11 01:51:09 +03:00
|
|
|
poolname, (longlong_t)policy.zlp_txg);
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
} else {
|
|
|
|
zfs_dbgmsg("spa_tryimport: importing %s", poolname);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (nvlist_lookup_string(tryconfig, ZPOOL_CONFIG_CACHEFILE, &cachefile)
|
|
|
|
== 0) {
|
|
|
|
zfs_dbgmsg("spa_tryimport: using cachefile '%s'", cachefile);
|
|
|
|
spa->spa_config_source = SPA_CONFIG_SRC_CACHEFILE;
|
|
|
|
} else {
|
|
|
|
spa->spa_config_source = SPA_CONFIG_SRC_SCAN;
|
|
|
|
}
|
|
|
|
|
|
|
|
error = spa_load(spa, SPA_LOAD_TRYIMPORT, SPA_IMPORT_EXISTING);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If 'tryconfig' was at least parsable, return the current config.
|
|
|
|
*/
|
|
|
|
if (spa->spa_root_vdev != NULL) {
|
|
|
|
config = spa_config_generate(spa, NULL, -1ULL, B_TRUE);
|
2021-10-27 02:15:38 +03:00
|
|
|
fnvlist_add_string(config, ZPOOL_CONFIG_POOL_NAME, poolname);
|
|
|
|
fnvlist_add_uint64(config, ZPOOL_CONFIG_POOL_STATE, state);
|
|
|
|
fnvlist_add_uint64(config, ZPOOL_CONFIG_TIMESTAMP,
|
|
|
|
spa->spa_uberblock.ub_timestamp);
|
|
|
|
fnvlist_add_nvlist(config, ZPOOL_CONFIG_LOAD_INFO,
|
|
|
|
spa->spa_load_info);
|
|
|
|
fnvlist_add_uint64(config, ZPOOL_CONFIG_ERRATA,
|
|
|
|
spa->spa_errata);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the bootfs property exists on this pool then we
|
|
|
|
* copy it out so that external consumers can tell which
|
|
|
|
* pools are bootable.
|
|
|
|
*/
|
2009-02-18 23:51:31 +03:00
|
|
|
if ((!error || error == EEXIST) && spa->spa_bootfs) {
|
2014-11-21 03:09:39 +03:00
|
|
|
char *tmpname = kmem_alloc(MAXPATHLEN, KM_SLEEP);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We have to play games with the name since the
|
|
|
|
* pool was opened as TRYIMPORT_NAME.
|
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
if (dsl_dsobj_to_dsname(spa_name(spa),
|
2008-11-20 23:01:55 +03:00
|
|
|
spa->spa_bootfs, tmpname) == 0) {
|
|
|
|
char *cp;
|
2013-11-01 23:26:11 +04:00
|
|
|
char *dsname;
|
|
|
|
|
2014-11-21 03:09:39 +03:00
|
|
|
dsname = kmem_alloc(MAXPATHLEN, KM_SLEEP);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
cp = strchr(tmpname, '/');
|
|
|
|
if (cp == NULL) {
|
|
|
|
(void) strlcpy(dsname, tmpname,
|
|
|
|
MAXPATHLEN);
|
|
|
|
} else {
|
|
|
|
(void) snprintf(dsname, MAXPATHLEN,
|
|
|
|
"%s/%s", poolname, ++cp);
|
|
|
|
}
|
2021-10-27 02:15:38 +03:00
|
|
|
fnvlist_add_string(config, ZPOOL_CONFIG_BOOTFS,
|
|
|
|
dsname);
|
2008-11-20 23:01:55 +03:00
|
|
|
kmem_free(dsname, MAXPATHLEN);
|
|
|
|
}
|
|
|
|
kmem_free(tmpname, MAXPATHLEN);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Add the list of hot spares and level 2 cache devices.
|
|
|
|
*/
|
2009-07-03 02:44:48 +04:00
|
|
|
spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
|
2008-11-20 23:01:55 +03:00
|
|
|
spa_add_spares(spa, config);
|
|
|
|
spa_add_l2cache(spa, config);
|
2009-07-03 02:44:48 +04:00
|
|
|
spa_config_exit(spa, SCL_CONFIG, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
spa_unload(spa);
|
|
|
|
spa_deactivate(spa);
|
|
|
|
spa_remove(spa);
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
|
|
|
|
return (config);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Pool export/destroy
|
|
|
|
*
|
|
|
|
* The act of destroying or exporting a pool is very simple. We make sure there
|
|
|
|
* is no more pending I/O and any references to the pool are gone. Then, we
|
|
|
|
* update the pool state and sync all the labels to disk, removing the
|
2009-01-16 00:59:39 +03:00
|
|
|
* configuration from the cache afterwards. If the 'hardforce' flag is set, then
|
|
|
|
* we don't sync the labels or remove the configuration cache.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
static int
|
2020-10-03 03:44:10 +03:00
|
|
|
spa_export_common(const char *pool, int new_state, nvlist_t **oldconfig,
|
2009-01-16 00:59:39 +03:00
|
|
|
boolean_t force, boolean_t hardforce)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2021-01-26 02:04:11 +03:00
|
|
|
int error;
|
2008-11-20 23:01:55 +03:00
|
|
|
spa_t *spa;
|
|
|
|
|
|
|
|
if (oldconfig)
|
|
|
|
*oldconfig = NULL;
|
|
|
|
|
2019-11-21 20:32:57 +03:00
|
|
|
if (!(spa_mode_global & SPA_MODE_WRITE))
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(EROFS));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
if ((spa = spa_lookup(pool)) == NULL) {
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(ENOENT));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2019-07-18 23:02:33 +03:00
|
|
|
if (spa->spa_is_exporting) {
|
|
|
|
/* the pool is being exported by another thread */
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
return (SET_ERROR(ZFS_ERR_EXPORT_IN_PROGRESS));
|
|
|
|
}
|
|
|
|
spa->spa_is_exporting = B_TRUE;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Put a hold on the pool, drop the namespace lock, stop async tasks,
|
|
|
|
* reacquire the namespace lock, and see if we can export.
|
|
|
|
*/
|
|
|
|
spa_open_ref(spa, FTAG);
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
spa_async_suspend(spa);
|
2014-03-22 13:07:14 +04:00
|
|
|
if (spa->spa_zvol_taskq) {
|
|
|
|
zvol_remove_minors(spa, spa_name(spa), B_TRUE);
|
|
|
|
taskq_wait(spa->spa_zvol_taskq);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
spa_close(spa, FTAG);
|
|
|
|
|
2015-02-27 08:46:45 +03:00
|
|
|
if (spa->spa_state == POOL_STATE_UNINITIALIZED)
|
|
|
|
goto export_spa;
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2015-02-27 08:46:45 +03:00
|
|
|
* The pool will be in core if it's openable, in which case we can
|
|
|
|
* modify its state. Objsets may be open only because they're dirty,
|
|
|
|
* so we have to force it to sync before checking spa_refcnt.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2015-04-02 06:44:32 +03:00
|
|
|
if (spa->spa_sync_on) {
|
2008-11-20 23:01:55 +03:00
|
|
|
txg_wait_synced(spa->spa_dsl_pool, 0);
|
2015-04-02 06:44:32 +03:00
|
|
|
spa_evicting_os_wait(spa);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-02-27 08:46:45 +03:00
|
|
|
/*
|
|
|
|
* A pool cannot be exported or destroyed if there are active
|
|
|
|
* references. If we are resetting a pool, allow references by
|
|
|
|
* fault injection handlers.
|
|
|
|
*/
|
2021-01-26 02:04:11 +03:00
|
|
|
if (!spa_refcount_zero(spa) || (spa->spa_inject_ref != 0)) {
|
|
|
|
error = SET_ERROR(EBUSY);
|
|
|
|
goto fail;
|
2015-02-27 08:46:45 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-02-27 08:46:45 +03:00
|
|
|
if (spa->spa_sync_on) {
|
2022-09-28 02:45:51 +03:00
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* A pool cannot be exported if it has an active shared spare.
|
|
|
|
* This is to prevent other pools stealing the active spare
|
|
|
|
* from an exported pool. At user's own will, such pool can
|
|
|
|
* be forcedly exported.
|
|
|
|
*/
|
|
|
|
if (!force && new_state == POOL_STATE_EXPORTED &&
|
|
|
|
spa_has_active_shared_spare(spa)) {
|
2021-01-26 02:04:11 +03:00
|
|
|
error = SET_ERROR(EXDEV);
|
|
|
|
goto fail;
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
/*
|
|
|
|
* We're about to export or destroy this pool. Make sure
|
2019-03-29 19:13:20 +03:00
|
|
|
* we stop all initialization and trim activity here before
|
|
|
|
* we set the spa_final_txg. This will ensure that all
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
* dirty data resulting from the initialization is
|
|
|
|
* committed to disk before we unload the pool.
|
|
|
|
*/
|
2022-09-28 02:45:51 +03:00
|
|
|
vdev_initialize_stop_all(rvd, VDEV_INITIALIZE_ACTIVE);
|
|
|
|
vdev_trim_stop_all(rvd, VDEV_TRIM_ACTIVE);
|
|
|
|
vdev_autotrim_stop_all(spa);
|
|
|
|
vdev_rebuild_stop_all(spa);
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* We want this to be reflected on every label,
|
|
|
|
* so mark them all dirty. spa_unload() will do the
|
|
|
|
* final sync that pushes these changes out.
|
|
|
|
*/
|
2009-01-16 00:59:39 +03:00
|
|
|
if (new_state != POOL_STATE_UNINITIALIZED && !hardforce) {
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
|
2008-11-20 23:01:55 +03:00
|
|
|
spa->spa_state = new_state;
|
2022-09-28 02:45:51 +03:00
|
|
|
vdev_config_dirty(rvd);
|
2022-02-16 02:48:59 +03:00
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the log space map feature is enabled and the pool is
|
|
|
|
* getting exported (but not destroyed), we want to spend some
|
|
|
|
* time flushing as many metaslabs as we can in an attempt to
|
|
|
|
* destroy log space maps and save import time. This has to be
|
|
|
|
* done before we set the spa_final_txg, otherwise
|
|
|
|
* spa_sync() -> spa_flush_metaslabs() may dirty the final TXGs.
|
|
|
|
* spa_should_flush_logs_on_unload() should be called after
|
|
|
|
* spa_state has been set to the new_state.
|
|
|
|
*/
|
|
|
|
if (spa_should_flush_logs_on_unload(spa))
|
|
|
|
spa_unload_log_sm_flush_all(spa);
|
|
|
|
|
|
|
|
if (new_state != POOL_STATE_UNINITIALIZED && !hardforce) {
|
|
|
|
spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
|
2010-05-29 00:45:14 +04:00
|
|
|
spa->spa_final_txg = spa_last_synced_txg(spa) +
|
|
|
|
TXG_DEFER_SIZE + 1;
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-02-27 08:46:45 +03:00
|
|
|
export_spa:
|
2022-02-16 02:54:25 +03:00
|
|
|
spa_export_os(spa);
|
|
|
|
|
2017-10-23 19:45:59 +03:00
|
|
|
if (new_state == POOL_STATE_DESTROYED)
|
|
|
|
spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_DESTROY);
|
|
|
|
else if (new_state == POOL_STATE_EXPORTED)
|
|
|
|
spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_EXPORT);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (spa->spa_state != POOL_STATE_UNINITIALIZED) {
|
|
|
|
spa_unload(spa);
|
|
|
|
spa_deactivate(spa);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (oldconfig && spa->spa_config)
|
2021-10-27 02:15:38 +03:00
|
|
|
*oldconfig = fnvlist_dup(spa->spa_config);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (new_state != POOL_STATE_UNINITIALIZED) {
|
2009-01-16 00:59:39 +03:00
|
|
|
if (!hardforce)
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
spa_write_cachefile(spa, B_TRUE, B_TRUE);
|
2008-11-20 23:01:55 +03:00
|
|
|
spa_remove(spa);
|
2019-07-18 23:02:33 +03:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* If spa_remove() is not called for this spa_t and
|
|
|
|
* there is any possibility that it can be reused,
|
|
|
|
* we make sure to reset the exporting flag.
|
|
|
|
*/
|
|
|
|
spa->spa_is_exporting = B_FALSE;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2019-07-18 23:02:33 +03:00
|
|
|
mutex_exit(&spa_namespace_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
return (0);
|
2021-01-26 02:04:11 +03:00
|
|
|
|
|
|
|
fail:
|
|
|
|
spa->spa_is_exporting = B_FALSE;
|
|
|
|
spa_async_resume(spa);
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
return (error);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Destroy a storage pool.
|
|
|
|
*/
|
|
|
|
int
|
2020-10-03 03:44:10 +03:00
|
|
|
spa_destroy(const char *pool)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2009-01-16 00:59:39 +03:00
|
|
|
return (spa_export_common(pool, POOL_STATE_DESTROYED, NULL,
|
|
|
|
B_FALSE, B_FALSE));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Export a storage pool.
|
|
|
|
*/
|
|
|
|
int
|
2020-10-03 03:44:10 +03:00
|
|
|
spa_export(const char *pool, nvlist_t **oldconfig, boolean_t force,
|
2009-01-16 00:59:39 +03:00
|
|
|
boolean_t hardforce)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2009-01-16 00:59:39 +03:00
|
|
|
return (spa_export_common(pool, POOL_STATE_EXPORTED, oldconfig,
|
|
|
|
force, hardforce));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Similar to spa_export(), this unloads the spa_t without actually removing it
|
|
|
|
* from the namespace in any way.
|
|
|
|
*/
|
|
|
|
int
|
2020-10-03 03:44:10 +03:00
|
|
|
spa_reset(const char *pool)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2008-12-03 23:09:06 +03:00
|
|
|
return (spa_export_common(pool, POOL_STATE_UNINITIALIZED, NULL,
|
2009-01-16 00:59:39 +03:00
|
|
|
B_FALSE, B_FALSE));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ==========================================================================
|
|
|
|
* Device manipulation
|
|
|
|
* ==========================================================================
|
|
|
|
*/
|
|
|
|
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
/*
|
|
|
|
* This is called as a synctask to increment the draid feature flag
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
spa_draid_feature_incr(void *arg, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
spa_t *spa = dmu_tx_pool(tx)->dp_spa;
|
|
|
|
int draid = (int)(uintptr_t)arg;
|
|
|
|
|
|
|
|
for (int c = 0; c < draid; c++)
|
|
|
|
spa_feature_incr(spa, SPA_FEATURE_DRAID, tx);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Add a device to a storage pool.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
spa_vdev_add(spa_t *spa, nvlist_t *nvroot)
|
|
|
|
{
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
uint64_t txg, ndraid = 0;
|
2009-01-16 00:59:39 +03:00
|
|
|
int error;
|
2008-11-20 23:01:55 +03:00
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
vdev_t *vd, *tvd;
|
|
|
|
nvlist_t **spares, **l2cache;
|
|
|
|
uint_t nspares, nl2cache;
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
ASSERT(spa_writeable(spa));
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
txg = spa_vdev_enter(spa);
|
|
|
|
|
|
|
|
if ((error = spa_config_parse(spa, &vd, nvroot, NULL, 0,
|
|
|
|
VDEV_ALLOC_ADD)) != 0)
|
|
|
|
return (spa_vdev_exit(spa, NULL, txg, error));
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
spa->spa_pending_vdev = vd; /* spa_vdev_exit() will clear this */
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_SPARES, &spares,
|
|
|
|
&nspares) != 0)
|
|
|
|
nspares = 0;
|
|
|
|
|
|
|
|
if (nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_L2CACHE, &l2cache,
|
|
|
|
&nl2cache) != 0)
|
|
|
|
nl2cache = 0;
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
if (vd->vdev_children == 0 && nspares == 0 && nl2cache == 0)
|
2008-11-20 23:01:55 +03:00
|
|
|
return (spa_vdev_exit(spa, vd, txg, EINVAL));
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
if (vd->vdev_children != 0 &&
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
(error = vdev_create(vd, txg, B_FALSE)) != 0) {
|
2008-12-03 23:09:06 +03:00
|
|
|
return (spa_vdev_exit(spa, vd, txg, error));
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The virtual dRAID spares must be added after vdev tree is created
|
2021-04-03 04:38:53 +03:00
|
|
|
* and the vdev guids are generated. The guid of their associated
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
* dRAID is stored in the config and used when opening the spare.
|
|
|
|
*/
|
|
|
|
if ((error = vdev_draid_spare_create(nvroot, vd, &ndraid,
|
|
|
|
rvd->vdev_children)) == 0) {
|
|
|
|
if (ndraid > 0 && nvlist_lookup_nvlist_array(nvroot,
|
|
|
|
ZPOOL_CONFIG_SPARES, &spares, &nspares) != 0)
|
|
|
|
nspares = 0;
|
|
|
|
} else {
|
|
|
|
return (spa_vdev_exit(spa, vd, txg, error));
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We must validate the spares and l2cache devices after checking the
|
|
|
|
* children. Otherwise, vdev_inuse() will blindly overwrite the spare.
|
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
if ((error = spa_validate_aux(spa, nvroot, txg, VDEV_ALLOC_ADD)) != 0)
|
2008-11-20 23:01:55 +03:00
|
|
|
return (spa_vdev_exit(spa, vd, txg, error));
|
|
|
|
|
|
|
|
/*
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
* If we are in the middle of a device removal, we can only add
|
|
|
|
* devices which match the existing devices in the pool.
|
|
|
|
* If we are in the middle of a removal, or have some indirect
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
* vdevs, we can not add raidz or dRAID top levels.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
if (spa->spa_vdev_removal != NULL ||
|
|
|
|
spa->spa_removing_phys.sr_prev_indirect_vdev != -1) {
|
|
|
|
for (int c = 0; c < vd->vdev_children; c++) {
|
|
|
|
tvd = vd->vdev_child[c];
|
|
|
|
if (spa->spa_vdev_removal != NULL &&
|
OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.
There are two underlying problems:
1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.
The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).
2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.
Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.
The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).
The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.
This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).
Porting notes:
* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
to dsl_scan_needs_resilver(), which was added to ZoL as part of the
sequential scrub work.
* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
parameter. The extra parameter is unique to ZoL.
* When posting indirect checksum errors the ABD can be passed directly,
zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-02-13 22:37:56 +03:00
|
|
|
tvd->vdev_ashift != spa->spa_max_ashift) {
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
return (spa_vdev_exit(spa, vd, txg, EINVAL));
|
|
|
|
}
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
/* Fail if top level vdev is raidz or a dRAID */
|
|
|
|
if (vdev_get_nparity(tvd) != 0)
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
return (spa_vdev_exit(spa, vd, txg, EINVAL));
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
/*
|
|
|
|
* Need the top level mirror to be
|
|
|
|
* a mirror of leaf vdevs only
|
|
|
|
*/
|
|
|
|
if (tvd->vdev_ops == &vdev_mirror_ops) {
|
|
|
|
for (uint64_t cid = 0;
|
|
|
|
cid < tvd->vdev_children; cid++) {
|
|
|
|
vdev_t *cvd = tvd->vdev_child[cid];
|
|
|
|
if (!cvd->vdev_ops->vdev_op_leaf) {
|
|
|
|
return (spa_vdev_exit(spa, vd,
|
|
|
|
txg, EINVAL));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-11-04 23:25:13 +03:00
|
|
|
for (int c = 0; c < vd->vdev_children; c++) {
|
2008-11-20 23:01:55 +03:00
|
|
|
tvd = vd->vdev_child[c];
|
|
|
|
vdev_remove_child(vd, tvd);
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
tvd->vdev_id = rvd->vdev_children;
|
2008-11-20 23:01:55 +03:00
|
|
|
vdev_add_child(rvd, tvd);
|
|
|
|
vdev_config_dirty(tvd);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (nspares != 0) {
|
|
|
|
spa_set_aux_vdevs(&spa->spa_spares, spares, nspares,
|
|
|
|
ZPOOL_CONFIG_SPARES);
|
|
|
|
spa_load_spares(spa);
|
|
|
|
spa->spa_spares.sav_sync = B_TRUE;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (nl2cache != 0) {
|
|
|
|
spa_set_aux_vdevs(&spa->spa_l2cache, l2cache, nl2cache,
|
|
|
|
ZPOOL_CONFIG_L2CACHE);
|
|
|
|
spa_load_l2cache(spa);
|
|
|
|
spa->spa_l2cache.sav_sync = B_TRUE;
|
|
|
|
}
|
|
|
|
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
/*
|
|
|
|
* We can't increment a feature while holding spa_vdev so we
|
|
|
|
* have to do it in a synctask.
|
|
|
|
*/
|
|
|
|
if (ndraid != 0) {
|
|
|
|
dmu_tx_t *tx;
|
|
|
|
|
|
|
|
tx = dmu_tx_create_assigned(spa->spa_dsl_pool, txg);
|
|
|
|
dsl_sync_task_nowait(spa->spa_dsl_pool, spa_draid_feature_incr,
|
|
|
|
(void *)(uintptr_t)ndraid, tx);
|
|
|
|
dmu_tx_commit(tx);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* We have to be careful when adding new vdevs to an existing pool.
|
|
|
|
* If other threads start allocating from these vdevs before we
|
|
|
|
* sync the config cache, and we lose power, then upon reboot we may
|
|
|
|
* fail to open the pool because there are DVAs that the config cache
|
|
|
|
* can't translate. Therefore, we first add the vdevs without
|
|
|
|
* initializing metaslabs; sync the config cache (via spa_vdev_exit());
|
|
|
|
* and then let spa_config_update() initialize the new metaslabs.
|
|
|
|
*
|
|
|
|
* spa_load() checks for added-but-not-initialized vdevs, so that
|
|
|
|
* if we lose power at any point in this sequence, the remaining
|
|
|
|
* steps will be completed the next time we load the pool.
|
|
|
|
*/
|
|
|
|
(void) spa_vdev_exit(spa, vd, txg, 0);
|
|
|
|
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
spa_config_update(spa, SPA_CONFIG_UPDATE_POOL);
|
2017-05-30 21:39:17 +03:00
|
|
|
spa_event_notify(spa, NULL, NULL, ESC_ZFS_VDEV_ADD);
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Attach a device to a mirror. The arguments are the path to any device
|
|
|
|
* in the mirror, and the nvroot for the new device. If the path specifies
|
|
|
|
* a device that is not mirrored, we automatically insert the mirror vdev.
|
|
|
|
*
|
|
|
|
* If 'replacing' is specified, the new device is intended to replace the
|
|
|
|
* existing device; in this case the two devices are made into their own
|
|
|
|
* mirror using the 'replacing' vdev, which is functionally identical to
|
|
|
|
* the mirror vdev (it actually reuses all the same ops) but has a few
|
|
|
|
* extra rules: you can't attach to it after it's been created, and upon
|
|
|
|
* completion of resilvering, the first disk (the one being replaced)
|
|
|
|
* is automatically detached.
|
2020-07-03 21:05:50 +03:00
|
|
|
*
|
|
|
|
* If 'rebuild' is specified, then sequential reconstruction (a.ka. rebuild)
|
|
|
|
* should be performed instead of traditional healing reconstruction. From
|
|
|
|
* an administrators perspective these are both resilver operations.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
int
|
2020-07-03 21:05:50 +03:00
|
|
|
spa_vdev_attach(spa_t *spa, uint64_t guid, nvlist_t *nvroot, int replacing,
|
|
|
|
int rebuild)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t txg, dtl_max_txg;
|
2020-07-03 21:05:50 +03:00
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
2008-11-20 23:01:55 +03:00
|
|
|
vdev_t *oldvd, *newvd, *newrootvd, *pvd, *tvd;
|
|
|
|
vdev_ops_t *pvops;
|
2008-12-03 23:09:06 +03:00
|
|
|
char *oldvdpath, *newvdpath;
|
|
|
|
int newvd_isspare;
|
|
|
|
int error;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
ASSERT(spa_writeable(spa));
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
txg = spa_vdev_enter(spa);
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
oldvd = spa_lookup_by_guid(spa, guid, B_FALSE);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
ASSERT(MUTEX_HELD(&spa_namespace_lock));
|
|
|
|
if (spa_feature_is_active(spa, SPA_FEATURE_POOL_CHECKPOINT)) {
|
|
|
|
error = (spa_has_checkpoint(spa)) ?
|
|
|
|
ZFS_ERR_CHECKPOINT_EXISTS : ZFS_ERR_DISCARDING_CHECKPOINT;
|
|
|
|
return (spa_vdev_exit(spa, NULL, txg, error));
|
|
|
|
}
|
|
|
|
|
2020-07-03 21:05:50 +03:00
|
|
|
if (rebuild) {
|
|
|
|
if (!spa_feature_is_enabled(spa, SPA_FEATURE_DEVICE_REBUILD))
|
|
|
|
return (spa_vdev_exit(spa, NULL, txg, ENOTSUP));
|
|
|
|
|
|
|
|
if (dsl_scan_resilvering(spa_get_dsl(spa)))
|
|
|
|
return (spa_vdev_exit(spa, NULL, txg,
|
|
|
|
ZFS_ERR_RESILVER_IN_PROGRESS));
|
|
|
|
} else {
|
|
|
|
if (vdev_rebuild_active(rvd))
|
|
|
|
return (spa_vdev_exit(spa, NULL, txg,
|
|
|
|
ZFS_ERR_REBUILD_IN_PROGRESS));
|
|
|
|
}
|
|
|
|
|
OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.
There are two underlying problems:
1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.
The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).
2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.
Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.
The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).
The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.
This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).
Porting notes:
* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
to dsl_scan_needs_resilver(), which was added to ZoL as part of the
sequential scrub work.
* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
parameter. The extra parameter is unique to ZoL.
* When posting indirect checksum errors the ABD can be passed directly,
zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-02-13 22:37:56 +03:00
|
|
|
if (spa->spa_vdev_removal != NULL)
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
return (spa_vdev_exit(spa, NULL, txg, EBUSY));
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (oldvd == NULL)
|
|
|
|
return (spa_vdev_exit(spa, NULL, txg, ENODEV));
|
|
|
|
|
|
|
|
if (!oldvd->vdev_ops->vdev_op_leaf)
|
|
|
|
return (spa_vdev_exit(spa, NULL, txg, ENOTSUP));
|
|
|
|
|
|
|
|
pvd = oldvd->vdev_parent;
|
|
|
|
|
|
|
|
if ((error = spa_config_parse(spa, &newrootvd, nvroot, NULL, 0,
|
2012-04-08 21:23:08 +04:00
|
|
|
VDEV_ALLOC_ATTACH)) != 0)
|
2008-11-20 23:01:55 +03:00
|
|
|
return (spa_vdev_exit(spa, NULL, txg, EINVAL));
|
|
|
|
|
|
|
|
if (newrootvd->vdev_children != 1)
|
|
|
|
return (spa_vdev_exit(spa, newrootvd, txg, EINVAL));
|
|
|
|
|
|
|
|
newvd = newrootvd->vdev_child[0];
|
|
|
|
|
|
|
|
if (!newvd->vdev_ops->vdev_op_leaf)
|
|
|
|
return (spa_vdev_exit(spa, newrootvd, txg, EINVAL));
|
|
|
|
|
|
|
|
if ((error = vdev_create(newrootvd, txg, replacing)) != 0)
|
|
|
|
return (spa_vdev_exit(spa, newrootvd, txg, error));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Spares can't replace logs
|
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
if (oldvd->vdev_top->vdev_islog && newvd->vdev_isspare)
|
2008-11-20 23:01:55 +03:00
|
|
|
return (spa_vdev_exit(spa, newrootvd, txg, ENOTSUP));
|
|
|
|
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
/*
|
|
|
|
* A dRAID spare can only replace a child of its parent dRAID vdev.
|
|
|
|
*/
|
|
|
|
if (newvd->vdev_ops == &vdev_draid_spare_ops &&
|
|
|
|
oldvd->vdev_top != vdev_draid_spare_get_parent(newvd)) {
|
|
|
|
return (spa_vdev_exit(spa, newrootvd, txg, ENOTSUP));
|
|
|
|
}
|
|
|
|
|
2020-07-03 21:05:50 +03:00
|
|
|
if (rebuild) {
|
|
|
|
/*
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
* For rebuilds, the top vdev must support reconstruction
|
2020-07-03 21:05:50 +03:00
|
|
|
* using only space maps. This means the only allowable
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
* vdevs types are the root vdev, a mirror, or dRAID.
|
2020-07-03 21:05:50 +03:00
|
|
|
*/
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
tvd = pvd;
|
|
|
|
if (pvd->vdev_top != NULL)
|
|
|
|
tvd = pvd->vdev_top;
|
|
|
|
|
|
|
|
if (tvd->vdev_ops != &vdev_mirror_ops &&
|
|
|
|
tvd->vdev_ops != &vdev_root_ops &&
|
|
|
|
tvd->vdev_ops != &vdev_draid_ops) {
|
2020-07-03 21:05:50 +03:00
|
|
|
return (spa_vdev_exit(spa, newrootvd, txg, ENOTSUP));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (!replacing) {
|
|
|
|
/*
|
|
|
|
* For attach, the only allowable parent is a mirror or the root
|
|
|
|
* vdev.
|
|
|
|
*/
|
|
|
|
if (pvd->vdev_ops != &vdev_mirror_ops &&
|
|
|
|
pvd->vdev_ops != &vdev_root_ops)
|
|
|
|
return (spa_vdev_exit(spa, newrootvd, txg, ENOTSUP));
|
|
|
|
|
|
|
|
pvops = &vdev_mirror_ops;
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Active hot spares can only be replaced by inactive hot
|
|
|
|
* spares.
|
|
|
|
*/
|
|
|
|
if (pvd->vdev_ops == &vdev_spare_ops &&
|
2010-08-27 01:24:34 +04:00
|
|
|
oldvd->vdev_isspare &&
|
2008-11-20 23:01:55 +03:00
|
|
|
!spa_has_spare(spa, newvd->vdev_guid))
|
|
|
|
return (spa_vdev_exit(spa, newrootvd, txg, ENOTSUP));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the source is a hot spare, and the parent isn't already a
|
|
|
|
* spare, then we want to create a new hot spare. Otherwise, we
|
|
|
|
* want to create a replacing vdev. The user is not allowed to
|
|
|
|
* attach to a spared vdev child unless the 'isspare' state is
|
|
|
|
* the same (spare replaces spare, non-spare replaces
|
|
|
|
* non-spare).
|
|
|
|
*/
|
2010-08-27 01:24:34 +04:00
|
|
|
if (pvd->vdev_ops == &vdev_replacing_ops &&
|
|
|
|
spa_version(spa) < SPA_VERSION_MULTI_REPLACE) {
|
2008-11-20 23:01:55 +03:00
|
|
|
return (spa_vdev_exit(spa, newrootvd, txg, ENOTSUP));
|
2010-08-27 01:24:34 +04:00
|
|
|
} else if (pvd->vdev_ops == &vdev_spare_ops &&
|
|
|
|
newvd->vdev_isspare != oldvd->vdev_isspare) {
|
2008-11-20 23:01:55 +03:00
|
|
|
return (spa_vdev_exit(spa, newrootvd, txg, ENOTSUP));
|
2010-08-27 01:24:34 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
if (newvd->vdev_isspare)
|
2008-11-20 23:01:55 +03:00
|
|
|
pvops = &vdev_spare_ops;
|
|
|
|
else
|
|
|
|
pvops = &vdev_replacing_ops;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2009-07-03 02:44:48 +04:00
|
|
|
* Make sure the new device is big enough.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2009-07-03 02:44:48 +04:00
|
|
|
if (newvd->vdev_asize < vdev_get_min_asize(oldvd))
|
2008-11-20 23:01:55 +03:00
|
|
|
return (spa_vdev_exit(spa, newrootvd, txg, EOVERFLOW));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The new device cannot have a higher alignment requirement
|
|
|
|
* than the top-level vdev.
|
|
|
|
*/
|
|
|
|
if (newvd->vdev_ashift > oldvd->vdev_top->vdev_ashift)
|
2020-07-03 21:05:50 +03:00
|
|
|
return (spa_vdev_exit(spa, newrootvd, txg, ENOTSUP));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If this is an in-place replacement, update oldvd's path and devid
|
|
|
|
* to make it distinguishable from newvd, and unopenable from now on.
|
|
|
|
*/
|
|
|
|
if (strcmp(oldvd->vdev_path, newvd->vdev_path) == 0) {
|
|
|
|
spa_strfree(oldvd->vdev_path);
|
|
|
|
oldvd->vdev_path = kmem_alloc(strlen(newvd->vdev_path) + 5,
|
2014-11-21 03:09:39 +03:00
|
|
|
KM_SLEEP);
|
2020-06-07 21:42:12 +03:00
|
|
|
(void) snprintf(oldvd->vdev_path, strlen(newvd->vdev_path) + 5,
|
|
|
|
"%s/%s", newvd->vdev_path, "old");
|
2008-11-20 23:01:55 +03:00
|
|
|
if (oldvd->vdev_devid != NULL) {
|
|
|
|
spa_strfree(oldvd->vdev_devid);
|
|
|
|
oldvd->vdev_devid = NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the parent is not a mirror, or if we're replacing, insert the new
|
|
|
|
* mirror/replacing/spare vdev above oldvd.
|
|
|
|
*/
|
|
|
|
if (pvd->vdev_ops != pvops)
|
|
|
|
pvd = vdev_add_parent(oldvd, pvops);
|
|
|
|
|
|
|
|
ASSERT(pvd->vdev_top->vdev_parent == rvd);
|
|
|
|
ASSERT(pvd->vdev_ops == pvops);
|
|
|
|
ASSERT(oldvd->vdev_parent == pvd);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Extract the new device from its root and add it to pvd.
|
|
|
|
*/
|
|
|
|
vdev_remove_child(newrootvd, newvd);
|
|
|
|
newvd->vdev_id = pvd->vdev_children;
|
2010-05-29 00:45:14 +04:00
|
|
|
newvd->vdev_crtxg = oldvd->vdev_crtxg;
|
2008-11-20 23:01:55 +03:00
|
|
|
vdev_add_child(pvd, newvd);
|
|
|
|
|
2017-02-23 21:32:15 +03:00
|
|
|
/*
|
|
|
|
* Reevaluate the parent vdev state.
|
|
|
|
*/
|
|
|
|
vdev_propagate_state(pvd);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
tvd = newvd->vdev_top;
|
|
|
|
ASSERT(pvd->vdev_top == tvd);
|
|
|
|
ASSERT(tvd->vdev_parent == rvd);
|
|
|
|
|
|
|
|
vdev_config_dirty(tvd);
|
|
|
|
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Set newvd's DTL to [TXG_INITIAL, dtl_max_txg) so that we account
|
|
|
|
* for any dmu_sync-ed blocks. It will propagate upward when
|
|
|
|
* spa_vdev_exit() calls vdev_dtl_reassess().
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
dtl_max_txg = txg + TXG_CONCURRENT_STATES;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2020-07-03 21:05:50 +03:00
|
|
|
vdev_dtl_dirty(newvd, DTL_MISSING,
|
|
|
|
TXG_INITIAL, dtl_max_txg - TXG_INITIAL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
if (newvd->vdev_isspare) {
|
2008-11-20 23:01:55 +03:00
|
|
|
spa_spare_activate(newvd);
|
2017-05-30 21:39:17 +03:00
|
|
|
spa_event_notify(spa, newvd, NULL, ESC_ZFS_VDEV_SPARE);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
oldvdpath = spa_strdup(oldvd->vdev_path);
|
|
|
|
newvdpath = spa_strdup(newvd->vdev_path);
|
|
|
|
newvd_isspare = newvd->vdev_isspare;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Mark newvd's DTL dirty in this txg.
|
|
|
|
*/
|
|
|
|
vdev_dirty(tvd, VDD_DTL, newvd, txg);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
2020-07-03 21:05:50 +03:00
|
|
|
* Schedule the resilver or rebuild to restart in the future. We do
|
|
|
|
* this to ensure that dmu_sync-ed blocks have been stitched into the
|
|
|
|
* respective datasets.
|
2010-05-29 00:45:14 +04:00
|
|
|
*/
|
2020-07-03 21:05:50 +03:00
|
|
|
if (rebuild) {
|
|
|
|
newvd->vdev_rebuild_txg = txg;
|
|
|
|
|
|
|
|
vdev_rebuild(tvd);
|
|
|
|
} else {
|
|
|
|
newvd->vdev_resilver_txg = txg;
|
|
|
|
|
|
|
|
if (dsl_scan_resilvering(spa_get_dsl(spa)) &&
|
|
|
|
spa_feature_is_enabled(spa, SPA_FEATURE_RESILVER_DEFER)) {
|
|
|
|
vdev_defer_resilver(newvd);
|
|
|
|
} else {
|
|
|
|
dsl_scan_restart_resilver(spa->spa_dsl_pool,
|
|
|
|
dtl_max_txg);
|
|
|
|
}
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2016-07-28 01:29:15 +03:00
|
|
|
if (spa->spa_bootfs)
|
2017-05-30 21:39:17 +03:00
|
|
|
spa_event_notify(spa, newvd, NULL, ESC_ZFS_BOOTFS_VDEV_ATTACH);
|
2016-07-28 01:29:15 +03:00
|
|
|
|
2017-05-30 21:39:17 +03:00
|
|
|
spa_event_notify(spa, newvd, NULL, ESC_ZFS_VDEV_ATTACH);
|
2016-07-28 01:29:15 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Commit the config
|
|
|
|
*/
|
|
|
|
(void) spa_vdev_exit(spa, newrootvd, dtl_max_txg, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-08-28 15:45:09 +04:00
|
|
|
spa_history_log_internal(spa, "vdev attach", NULL,
|
2010-05-29 00:45:14 +04:00
|
|
|
"%s vdev=%s %s vdev=%s",
|
2009-08-18 22:43:27 +04:00
|
|
|
replacing && newvd_isspare ? "spare in" :
|
|
|
|
replacing ? "replace" : "attach", newvdpath,
|
|
|
|
replacing ? "for" : "to", oldvdpath);
|
2008-12-03 23:09:06 +03:00
|
|
|
|
|
|
|
spa_strfree(oldvdpath);
|
|
|
|
spa_strfree(newvdpath);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Detach a device from a mirror or replacing vdev.
|
2013-06-11 21:12:34 +04:00
|
|
|
*
|
2008-11-20 23:01:55 +03:00
|
|
|
* If 'replace_done' is specified, only detach if the parent
|
|
|
|
* is a replacing vdev.
|
|
|
|
*/
|
|
|
|
int
|
2009-01-16 00:59:39 +03:00
|
|
|
spa_vdev_detach(spa_t *spa, uint64_t guid, uint64_t pguid, int replace_done)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
uint64_t txg;
|
2009-01-16 00:59:39 +03:00
|
|
|
int error;
|
2019-12-05 23:37:00 +03:00
|
|
|
vdev_t *rvd __maybe_unused = spa->spa_root_vdev;
|
2008-11-20 23:01:55 +03:00
|
|
|
vdev_t *vd, *pvd, *cvd, *tvd;
|
|
|
|
boolean_t unspare = B_FALSE;
|
2010-08-26 20:58:04 +04:00
|
|
|
uint64_t unspare_guid = 0;
|
2010-05-29 00:45:14 +04:00
|
|
|
char *vdpath;
|
2017-11-04 23:25:13 +03:00
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
ASSERT(spa_writeable(spa));
|
|
|
|
|
2020-07-03 21:05:50 +03:00
|
|
|
txg = spa_vdev_detach_enter(spa, guid);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
vd = spa_lookup_by_guid(spa, guid, B_FALSE);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
/*
|
|
|
|
* Besides being called directly from the userland through the
|
|
|
|
* ioctl interface, spa_vdev_detach() can be potentially called
|
|
|
|
* at the end of spa_vdev_resilver_done().
|
|
|
|
*
|
|
|
|
* In the regular case, when we have a checkpoint this shouldn't
|
|
|
|
* happen as we never empty the DTLs of a vdev during the scrub
|
|
|
|
* [see comment in dsl_scan_done()]. Thus spa_vdev_resilvering_done()
|
|
|
|
* should never get here when we have a checkpoint.
|
|
|
|
*
|
|
|
|
* That said, even in a case when we checkpoint the pool exactly
|
|
|
|
* as spa_vdev_resilver_done() calls this function everything
|
|
|
|
* should be fine as the resilver will return right away.
|
|
|
|
*/
|
|
|
|
ASSERT(MUTEX_HELD(&spa_namespace_lock));
|
|
|
|
if (spa_feature_is_active(spa, SPA_FEATURE_POOL_CHECKPOINT)) {
|
|
|
|
error = (spa_has_checkpoint(spa)) ?
|
|
|
|
ZFS_ERR_CHECKPOINT_EXISTS : ZFS_ERR_DISCARDING_CHECKPOINT;
|
|
|
|
return (spa_vdev_exit(spa, NULL, txg, error));
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (vd == NULL)
|
|
|
|
return (spa_vdev_exit(spa, NULL, txg, ENODEV));
|
|
|
|
|
|
|
|
if (!vd->vdev_ops->vdev_op_leaf)
|
|
|
|
return (spa_vdev_exit(spa, NULL, txg, ENOTSUP));
|
|
|
|
|
|
|
|
pvd = vd->vdev_parent;
|
|
|
|
|
2009-01-16 00:59:39 +03:00
|
|
|
/*
|
|
|
|
* If the parent/child relationship is not as expected, don't do it.
|
|
|
|
* Consider M(A,R(B,C)) -- that is, a mirror of A with a replacing
|
|
|
|
* vdev that's replacing B with C. The user's intent in replacing
|
|
|
|
* is to go from M(A,B) to M(A,C). If the user decides to cancel
|
|
|
|
* the replace by detaching C, the expected behavior is to end up
|
|
|
|
* M(A,B). But suppose that right after deciding to detach C,
|
|
|
|
* the replacement of B completes. We would have M(A,C), and then
|
|
|
|
* ask to detach C, which would leave us with just A -- not what
|
|
|
|
* the user wanted. To prevent this, we make sure that the
|
|
|
|
* parent/child relationship hasn't changed -- in this example,
|
|
|
|
* that C's parent is still the replacing vdev R.
|
|
|
|
*/
|
|
|
|
if (pvd->vdev_guid != pguid && pguid != 0)
|
|
|
|
return (spa_vdev_exit(spa, NULL, txg, EBUSY));
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2010-08-27 01:24:34 +04:00
|
|
|
* Only 'replacing' or 'spare' vdevs can be replaced.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2010-08-27 01:24:34 +04:00
|
|
|
if (replace_done && pvd->vdev_ops != &vdev_replacing_ops &&
|
|
|
|
pvd->vdev_ops != &vdev_spare_ops)
|
|
|
|
return (spa_vdev_exit(spa, NULL, txg, ENOTSUP));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT(pvd->vdev_ops != &vdev_spare_ops ||
|
|
|
|
spa_version(spa) >= SPA_VERSION_SPARES);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Only mirror, replacing, and spare vdevs support detach.
|
|
|
|
*/
|
|
|
|
if (pvd->vdev_ops != &vdev_replacing_ops &&
|
|
|
|
pvd->vdev_ops != &vdev_mirror_ops &&
|
|
|
|
pvd->vdev_ops != &vdev_spare_ops)
|
|
|
|
return (spa_vdev_exit(spa, NULL, txg, ENOTSUP));
|
|
|
|
|
|
|
|
/*
|
2009-01-16 00:59:39 +03:00
|
|
|
* If this device has the only valid copy of some data,
|
|
|
|
* we cannot safely detach it.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2009-01-16 00:59:39 +03:00
|
|
|
if (vdev_dtl_required(vd))
|
2008-11-20 23:01:55 +03:00
|
|
|
return (spa_vdev_exit(spa, NULL, txg, EBUSY));
|
|
|
|
|
2009-01-16 00:59:39 +03:00
|
|
|
ASSERT(pvd->vdev_children >= 2);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* If we are detaching the second disk from a replacing vdev, then
|
|
|
|
* check to see if we changed the original vdev's path to have "/old"
|
|
|
|
* at the end in spa_vdev_attach(). If so, undo that change now.
|
|
|
|
*/
|
2010-08-27 01:24:34 +04:00
|
|
|
if (pvd->vdev_ops == &vdev_replacing_ops && vd->vdev_id > 0 &&
|
|
|
|
vd->vdev_path != NULL) {
|
|
|
|
size_t len = strlen(vd->vdev_path);
|
|
|
|
|
2017-11-04 23:25:13 +03:00
|
|
|
for (int c = 0; c < pvd->vdev_children; c++) {
|
2010-08-27 01:24:34 +04:00
|
|
|
cvd = pvd->vdev_child[c];
|
|
|
|
|
|
|
|
if (cvd == vd || cvd->vdev_path == NULL)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (strncmp(cvd->vdev_path, vd->vdev_path, len) == 0 &&
|
|
|
|
strcmp(cvd->vdev_path + len, "/old") == 0) {
|
|
|
|
spa_strfree(cvd->vdev_path);
|
|
|
|
cvd->vdev_path = spa_strdup(vd->vdev_path);
|
|
|
|
break;
|
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
* If we are detaching the original disk from a normal spare, then it
|
|
|
|
* implies that the spare should become a real disk, and be removed
|
|
|
|
* from the active spare list for the pool. dRAID spares on the
|
|
|
|
* other hand are coupled to the pool and thus should never be removed
|
|
|
|
* from the spares list.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
if (pvd->vdev_ops == &vdev_spare_ops && vd->vdev_id == 0) {
|
|
|
|
vdev_t *last_cvd = pvd->vdev_child[pvd->vdev_children - 1];
|
|
|
|
|
|
|
|
if (last_cvd->vdev_isspare &&
|
|
|
|
last_cvd->vdev_ops != &vdev_draid_spare_ops) {
|
|
|
|
unspare = B_TRUE;
|
|
|
|
}
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Erase the disk labels so the disk can be used for other things.
|
|
|
|
* This must be done after all other error cases are handled,
|
|
|
|
* but before we disembowel vd (so we can still do I/O to it).
|
|
|
|
* But if we can't do it, don't treat the error as fatal --
|
|
|
|
* it may be that the unwritability of the disk is the reason
|
|
|
|
* it's being detached!
|
|
|
|
*/
|
|
|
|
error = vdev_label_init(vd, 0, VDEV_LABEL_REMOVE);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remove vd from its parent and compact the parent's children.
|
|
|
|
*/
|
|
|
|
vdev_remove_child(pvd, vd);
|
|
|
|
vdev_compact_children(pvd);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remember one of the remaining children so we can get tvd below.
|
|
|
|
*/
|
2010-08-27 01:24:34 +04:00
|
|
|
cvd = pvd->vdev_child[pvd->vdev_children - 1];
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we need to remove the remaining child from the list of hot spares,
|
2009-01-16 00:59:39 +03:00
|
|
|
* do it now, marking the vdev as no longer a spare in the process.
|
|
|
|
* We must do this before vdev_remove_parent(), because that can
|
|
|
|
* change the GUID if it creates a new toplevel GUID. For a similar
|
|
|
|
* reason, we must remove the spare now, in the same txg as the detach;
|
|
|
|
* otherwise someone could attach a new sibling, change the GUID, and
|
|
|
|
* the subsequent attempt to spa_vdev_remove(unspare_guid) would fail.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
if (unspare) {
|
|
|
|
ASSERT(cvd->vdev_isspare);
|
|
|
|
spa_spare_remove(cvd);
|
|
|
|
unspare_guid = cvd->vdev_guid;
|
2009-01-16 00:59:39 +03:00
|
|
|
(void) spa_vdev_remove(spa, unspare_guid, B_TRUE);
|
2010-08-27 01:24:34 +04:00
|
|
|
cvd->vdev_unspare = B_TRUE;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* If the parent mirror/replacing vdev only has one child,
|
|
|
|
* the parent is no longer needed. Remove it from the tree.
|
|
|
|
*/
|
2010-08-27 01:24:34 +04:00
|
|
|
if (pvd->vdev_children == 1) {
|
|
|
|
if (pvd->vdev_ops == &vdev_spare_ops)
|
|
|
|
cvd->vdev_unspare = B_FALSE;
|
2010-05-29 00:45:14 +04:00
|
|
|
vdev_remove_parent(cvd);
|
2010-08-27 01:24:34 +04:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* We don't set tvd until now because the parent we just removed
|
|
|
|
* may have been the previous top-level vdev.
|
|
|
|
*/
|
|
|
|
tvd = cvd->vdev_top;
|
|
|
|
ASSERT(tvd->vdev_parent == rvd);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Reevaluate the parent vdev state.
|
|
|
|
*/
|
|
|
|
vdev_propagate_state(cvd);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the 'autoexpand' property is set on the pool then automatically
|
|
|
|
* try to expand the size of the pool. For example if the device we
|
|
|
|
* just detached was smaller than the others, it may be possible to
|
|
|
|
* add metaslabs (i.e. grow the pool). We need to reopen the vdev
|
|
|
|
* first so that we can obtain the updated sizes of the leaf vdevs.
|
|
|
|
*/
|
|
|
|
if (spa->spa_autoexpand) {
|
|
|
|
vdev_reopen(tvd);
|
|
|
|
vdev_expand(tvd, txg);
|
|
|
|
}
|
|
|
|
|
|
|
|
vdev_config_dirty(tvd);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Mark vd's DTL as dirty in this txg. vdev_dtl_sync() will see that
|
|
|
|
* vd->vdev_detached is set and free vd's DTL object in syncing context.
|
|
|
|
* But first make sure we're not on any *other* txg's DTL list, to
|
|
|
|
* prevent vd from being accessed after it's freed.
|
|
|
|
*/
|
2016-10-14 21:00:47 +03:00
|
|
|
vdpath = spa_strdup(vd->vdev_path ? vd->vdev_path : "none");
|
2017-11-04 23:25:13 +03:00
|
|
|
for (int t = 0; t < TXG_SIZE; t++)
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) txg_list_remove_this(&tvd->vdev_dtl_list, vd, t);
|
|
|
|
vd->vdev_detached = B_TRUE;
|
|
|
|
vdev_dirty(tvd, VDD_DTL, vd, txg);
|
|
|
|
|
2017-05-30 21:39:17 +03:00
|
|
|
spa_event_notify(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE);
|
Add subcommand to wait for background zfs activity to complete
Currently the best way to wait for the completion of a long-running
operation in a pool, like a scrub or device removal, is to poll 'zpool
status' and parse its output, which is neither efficient nor convenient.
This change adds a 'wait' subcommand to the zpool command. When invoked,
'zpool wait' will block until a specified type of background activity
completes. Currently, this subcommand can wait for any of the following:
- Scrubs or resilvers to complete
- Devices to initialized
- Devices to be replaced
- Devices to be removed
- Checkpoints to be discarded
- Background freeing to complete
For example, a scrub that is in progress could be waited for by running
zpool wait -t scrub <pool>
This also adds a -w flag to the attach, checkpoint, initialize, replace,
remove, and scrub subcommands. When used, this flag makes the operations
kicked off by these subcommands synchronous instead of asynchronous.
This functionality is implemented using a new ioctl. The type of
activity to wait for is provided as input to the ioctl, and the ioctl
blocks until all activity of that type has completed. An ioctl was used
over other methods of kernel-userspace communiction primarily for the
sake of portability.
Porting Notes:
This is ported from Delphix OS change DLPX-44432. The following changes
were made while porting:
- Added ZoL-style ioctl input declaration.
- Reorganized error handling in zpool_initialize in libzfs to integrate
better with changes made for TRIM support.
- Fixed check for whether a checkpoint discard is in progress.
Previously it also waited if the pool had a checkpoint, instead of
just if a checkpoint was being discarded.
- Exposed zfs_initialize_chunk_size as a ZoL-style tunable.
- Updated more existing tests to make use of new 'zpool wait'
functionality, tests that don't exist in Delphix OS.
- Used existing ZoL tunable zfs_scan_suspend_progress, together with
zinject, in place of a new tunable zfs_scan_max_blks_per_txg.
- Added support for a non-integral interval argument to zpool wait.
Future work:
ZoL has support for trimming devices, which Delphix OS does not. In the
future, 'zpool wait' could be extended to add the ability to wait for
trim operations to complete.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #9162
2019-09-14 04:09:06 +03:00
|
|
|
spa_notify_waiters(spa);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
/* hang on to the spa before we release the lock */
|
|
|
|
spa_open_ref(spa, FTAG);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
error = spa_vdev_exit(spa, vd, txg, 0);
|
|
|
|
|
2013-08-28 15:45:09 +04:00
|
|
|
spa_history_log_internal(spa, "detach", NULL,
|
2010-05-29 00:45:14 +04:00
|
|
|
"vdev=%s", vdpath);
|
|
|
|
spa_strfree(vdpath);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If this was the removal of the original device in a hot spare vdev,
|
|
|
|
* then we want to go through and remove the device from the hot spare
|
|
|
|
* list of every other pool.
|
|
|
|
*/
|
|
|
|
if (unspare) {
|
2010-08-27 01:24:34 +04:00
|
|
|
spa_t *altspa = NULL;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_enter(&spa_namespace_lock);
|
2010-08-27 01:24:34 +04:00
|
|
|
while ((altspa = spa_next(altspa)) != NULL) {
|
|
|
|
if (altspa->spa_state != POOL_STATE_ACTIVE ||
|
|
|
|
altspa == spa)
|
2010-05-29 00:45:14 +04:00
|
|
|
continue;
|
2010-08-27 01:24:34 +04:00
|
|
|
|
|
|
|
spa_open_ref(altspa, FTAG);
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_exit(&spa_namespace_lock);
|
2010-08-27 01:24:34 +04:00
|
|
|
(void) spa_vdev_remove(altspa, unspare_guid, B_TRUE);
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_enter(&spa_namespace_lock);
|
2010-08-27 01:24:34 +04:00
|
|
|
spa_close(altspa, FTAG);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
2010-08-27 01:24:34 +04:00
|
|
|
|
|
|
|
/* search the rest of the vdevs for spares to remove */
|
|
|
|
spa_vdev_resilver_done(spa);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
/* all done with the spa; OK to release */
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
spa_close(spa, FTAG);
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2018-12-19 19:20:39 +03:00
|
|
|
static int
|
|
|
|
spa_vdev_initialize_impl(spa_t *spa, uint64_t guid, uint64_t cmd_type,
|
|
|
|
list_t *vd_list)
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
{
|
2018-12-19 19:20:39 +03:00
|
|
|
ASSERT(MUTEX_HELD(&spa_namespace_lock));
|
|
|
|
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
spa_config_enter(spa, SCL_CONFIG | SCL_STATE, FTAG, RW_READER);
|
|
|
|
|
|
|
|
/* Look up vdev and ensure it's a leaf. */
|
|
|
|
vdev_t *vd = spa_lookup_by_guid(spa, guid, B_FALSE);
|
|
|
|
if (vd == NULL || vd->vdev_detached) {
|
|
|
|
spa_config_exit(spa, SCL_CONFIG | SCL_STATE, FTAG);
|
|
|
|
return (SET_ERROR(ENODEV));
|
|
|
|
} else if (!vd->vdev_ops->vdev_op_leaf || !vdev_is_concrete(vd)) {
|
|
|
|
spa_config_exit(spa, SCL_CONFIG | SCL_STATE, FTAG);
|
|
|
|
return (SET_ERROR(EINVAL));
|
|
|
|
} else if (!vdev_writeable(vd)) {
|
|
|
|
spa_config_exit(spa, SCL_CONFIG | SCL_STATE, FTAG);
|
|
|
|
return (SET_ERROR(EROFS));
|
|
|
|
}
|
|
|
|
mutex_enter(&vd->vdev_initialize_lock);
|
|
|
|
spa_config_exit(spa, SCL_CONFIG | SCL_STATE, FTAG);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* When we activate an initialize action we check to see
|
|
|
|
* if the vdev_initialize_thread is NULL. We do this instead
|
|
|
|
* of using the vdev_initialize_state since there might be
|
|
|
|
* a previous initialization process which has completed but
|
|
|
|
* the thread is not exited.
|
|
|
|
*/
|
2019-03-29 19:13:20 +03:00
|
|
|
if (cmd_type == POOL_INITIALIZE_START &&
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
(vd->vdev_initialize_thread != NULL ||
|
|
|
|
vd->vdev_top->vdev_removing)) {
|
|
|
|
mutex_exit(&vd->vdev_initialize_lock);
|
|
|
|
return (SET_ERROR(EBUSY));
|
|
|
|
} else if (cmd_type == POOL_INITIALIZE_CANCEL &&
|
|
|
|
(vd->vdev_initialize_state != VDEV_INITIALIZE_ACTIVE &&
|
|
|
|
vd->vdev_initialize_state != VDEV_INITIALIZE_SUSPENDED)) {
|
|
|
|
mutex_exit(&vd->vdev_initialize_lock);
|
|
|
|
return (SET_ERROR(ESRCH));
|
|
|
|
} else if (cmd_type == POOL_INITIALIZE_SUSPEND &&
|
|
|
|
vd->vdev_initialize_state != VDEV_INITIALIZE_ACTIVE) {
|
|
|
|
mutex_exit(&vd->vdev_initialize_lock);
|
|
|
|
return (SET_ERROR(ESRCH));
|
|
|
|
}
|
|
|
|
|
|
|
|
switch (cmd_type) {
|
2019-03-29 19:13:20 +03:00
|
|
|
case POOL_INITIALIZE_START:
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
vdev_initialize(vd);
|
|
|
|
break;
|
|
|
|
case POOL_INITIALIZE_CANCEL:
|
2018-12-19 19:20:39 +03:00
|
|
|
vdev_initialize_stop(vd, VDEV_INITIALIZE_CANCELED, vd_list);
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
break;
|
|
|
|
case POOL_INITIALIZE_SUSPEND:
|
2018-12-19 19:20:39 +03:00
|
|
|
vdev_initialize_stop(vd, VDEV_INITIALIZE_SUSPENDED, vd_list);
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
panic("invalid cmd_type %llu", (unsigned long long)cmd_type);
|
|
|
|
}
|
|
|
|
mutex_exit(&vd->vdev_initialize_lock);
|
|
|
|
|
2018-12-19 19:20:39 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
spa_vdev_initialize(spa_t *spa, nvlist_t *nv, uint64_t cmd_type,
|
|
|
|
nvlist_t *vdev_errlist)
|
|
|
|
{
|
|
|
|
int total_errors = 0;
|
|
|
|
list_t vd_list;
|
|
|
|
|
|
|
|
list_create(&vd_list, sizeof (vdev_t),
|
|
|
|
offsetof(vdev_t, vdev_initialize_node));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We hold the namespace lock through the whole function
|
|
|
|
* to prevent any changes to the pool while we're starting or
|
|
|
|
* stopping initialization. The config and state locks are held so that
|
|
|
|
* we can properly assess the vdev state before we commit to
|
|
|
|
* the initializing operation.
|
|
|
|
*/
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
|
|
|
|
for (nvpair_t *pair = nvlist_next_nvpair(nv, NULL);
|
|
|
|
pair != NULL; pair = nvlist_next_nvpair(nv, pair)) {
|
|
|
|
uint64_t vdev_guid = fnvpair_value_uint64(pair);
|
|
|
|
|
|
|
|
int error = spa_vdev_initialize_impl(spa, vdev_guid, cmd_type,
|
|
|
|
&vd_list);
|
|
|
|
if (error != 0) {
|
|
|
|
char guid_as_str[MAXNAMELEN];
|
|
|
|
|
|
|
|
(void) snprintf(guid_as_str, sizeof (guid_as_str),
|
|
|
|
"%llu", (unsigned long long)vdev_guid);
|
|
|
|
fnvlist_add_int64(vdev_errlist, guid_as_str, error);
|
|
|
|
total_errors++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Wait for all initialize threads to stop. */
|
|
|
|
vdev_initialize_stop_wait(spa, &vd_list);
|
|
|
|
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
/* Sync out the initializing state */
|
|
|
|
txg_wait_synced(spa->spa_dsl_pool, 0);
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
|
2018-12-19 19:20:39 +03:00
|
|
|
list_destroy(&vd_list);
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
|
2018-12-19 19:20:39 +03:00
|
|
|
return (total_errors);
|
|
|
|
}
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
|
2019-03-29 19:13:20 +03:00
|
|
|
static int
|
|
|
|
spa_vdev_trim_impl(spa_t *spa, uint64_t guid, uint64_t cmd_type,
|
|
|
|
uint64_t rate, boolean_t partial, boolean_t secure, list_t *vd_list)
|
|
|
|
{
|
|
|
|
ASSERT(MUTEX_HELD(&spa_namespace_lock));
|
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_CONFIG | SCL_STATE, FTAG, RW_READER);
|
|
|
|
|
|
|
|
/* Look up vdev and ensure it's a leaf. */
|
|
|
|
vdev_t *vd = spa_lookup_by_guid(spa, guid, B_FALSE);
|
|
|
|
if (vd == NULL || vd->vdev_detached) {
|
|
|
|
spa_config_exit(spa, SCL_CONFIG | SCL_STATE, FTAG);
|
|
|
|
return (SET_ERROR(ENODEV));
|
|
|
|
} else if (!vd->vdev_ops->vdev_op_leaf || !vdev_is_concrete(vd)) {
|
|
|
|
spa_config_exit(spa, SCL_CONFIG | SCL_STATE, FTAG);
|
|
|
|
return (SET_ERROR(EINVAL));
|
|
|
|
} else if (!vdev_writeable(vd)) {
|
|
|
|
spa_config_exit(spa, SCL_CONFIG | SCL_STATE, FTAG);
|
|
|
|
return (SET_ERROR(EROFS));
|
|
|
|
} else if (!vd->vdev_has_trim) {
|
|
|
|
spa_config_exit(spa, SCL_CONFIG | SCL_STATE, FTAG);
|
|
|
|
return (SET_ERROR(EOPNOTSUPP));
|
|
|
|
} else if (secure && !vd->vdev_has_securetrim) {
|
|
|
|
spa_config_exit(spa, SCL_CONFIG | SCL_STATE, FTAG);
|
|
|
|
return (SET_ERROR(EOPNOTSUPP));
|
|
|
|
}
|
|
|
|
mutex_enter(&vd->vdev_trim_lock);
|
|
|
|
spa_config_exit(spa, SCL_CONFIG | SCL_STATE, FTAG);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* When we activate a TRIM action we check to see if the
|
|
|
|
* vdev_trim_thread is NULL. We do this instead of using the
|
|
|
|
* vdev_trim_state since there might be a previous TRIM process
|
|
|
|
* which has completed but the thread is not exited.
|
|
|
|
*/
|
|
|
|
if (cmd_type == POOL_TRIM_START &&
|
|
|
|
(vd->vdev_trim_thread != NULL || vd->vdev_top->vdev_removing)) {
|
|
|
|
mutex_exit(&vd->vdev_trim_lock);
|
|
|
|
return (SET_ERROR(EBUSY));
|
|
|
|
} else if (cmd_type == POOL_TRIM_CANCEL &&
|
|
|
|
(vd->vdev_trim_state != VDEV_TRIM_ACTIVE &&
|
|
|
|
vd->vdev_trim_state != VDEV_TRIM_SUSPENDED)) {
|
|
|
|
mutex_exit(&vd->vdev_trim_lock);
|
|
|
|
return (SET_ERROR(ESRCH));
|
|
|
|
} else if (cmd_type == POOL_TRIM_SUSPEND &&
|
|
|
|
vd->vdev_trim_state != VDEV_TRIM_ACTIVE) {
|
|
|
|
mutex_exit(&vd->vdev_trim_lock);
|
|
|
|
return (SET_ERROR(ESRCH));
|
|
|
|
}
|
|
|
|
|
|
|
|
switch (cmd_type) {
|
|
|
|
case POOL_TRIM_START:
|
|
|
|
vdev_trim(vd, rate, partial, secure);
|
|
|
|
break;
|
|
|
|
case POOL_TRIM_CANCEL:
|
|
|
|
vdev_trim_stop(vd, VDEV_TRIM_CANCELED, vd_list);
|
|
|
|
break;
|
|
|
|
case POOL_TRIM_SUSPEND:
|
|
|
|
vdev_trim_stop(vd, VDEV_TRIM_SUSPENDED, vd_list);
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
panic("invalid cmd_type %llu", (unsigned long long)cmd_type);
|
|
|
|
}
|
|
|
|
mutex_exit(&vd->vdev_trim_lock);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Initiates a manual TRIM for the requested vdevs. This kicks off individual
|
|
|
|
* TRIM threads for each child vdev. These threads pass over all of the free
|
|
|
|
* space in the vdev's metaslabs and issues TRIM commands for that space.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
spa_vdev_trim(spa_t *spa, nvlist_t *nv, uint64_t cmd_type, uint64_t rate,
|
|
|
|
boolean_t partial, boolean_t secure, nvlist_t *vdev_errlist)
|
|
|
|
{
|
|
|
|
int total_errors = 0;
|
|
|
|
list_t vd_list;
|
|
|
|
|
|
|
|
list_create(&vd_list, sizeof (vdev_t),
|
|
|
|
offsetof(vdev_t, vdev_trim_node));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We hold the namespace lock through the whole function
|
|
|
|
* to prevent any changes to the pool while we're starting or
|
|
|
|
* stopping TRIM. The config and state locks are held so that
|
|
|
|
* we can properly assess the vdev state before we commit to
|
|
|
|
* the TRIM operation.
|
|
|
|
*/
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
|
|
|
|
for (nvpair_t *pair = nvlist_next_nvpair(nv, NULL);
|
|
|
|
pair != NULL; pair = nvlist_next_nvpair(nv, pair)) {
|
|
|
|
uint64_t vdev_guid = fnvpair_value_uint64(pair);
|
|
|
|
|
|
|
|
int error = spa_vdev_trim_impl(spa, vdev_guid, cmd_type,
|
|
|
|
rate, partial, secure, &vd_list);
|
|
|
|
if (error != 0) {
|
|
|
|
char guid_as_str[MAXNAMELEN];
|
|
|
|
|
|
|
|
(void) snprintf(guid_as_str, sizeof (guid_as_str),
|
|
|
|
"%llu", (unsigned long long)vdev_guid);
|
|
|
|
fnvlist_add_int64(vdev_errlist, guid_as_str, error);
|
|
|
|
total_errors++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Wait for all TRIM threads to stop. */
|
|
|
|
vdev_trim_stop_wait(spa, &vd_list);
|
|
|
|
|
|
|
|
/* Sync out the TRIM state */
|
|
|
|
txg_wait_synced(spa->spa_dsl_pool, 0);
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
|
|
|
|
list_destroy(&vd_list);
|
|
|
|
|
|
|
|
return (total_errors);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Split a set of devices from their mirrors, and create a new pool from them.
|
|
|
|
*/
|
|
|
|
int
|
2022-04-19 21:38:30 +03:00
|
|
|
spa_vdev_split_mirror(spa_t *spa, const char *newname, nvlist_t *config,
|
2010-05-29 00:45:14 +04:00
|
|
|
nvlist_t *props, boolean_t exp)
|
|
|
|
{
|
|
|
|
int error = 0;
|
|
|
|
uint64_t txg, *glist;
|
|
|
|
spa_t *newspa;
|
|
|
|
uint_t c, children, lastlog;
|
|
|
|
nvlist_t **child, *nvl, *tmp;
|
|
|
|
dmu_tx_t *tx;
|
|
|
|
char *altroot = NULL;
|
|
|
|
vdev_t *rvd, **vml = NULL; /* vdev modify list */
|
|
|
|
boolean_t activate_slog;
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
ASSERT(spa_writeable(spa));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
txg = spa_vdev_enter(spa);
|
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
ASSERT(MUTEX_HELD(&spa_namespace_lock));
|
|
|
|
if (spa_feature_is_active(spa, SPA_FEATURE_POOL_CHECKPOINT)) {
|
|
|
|
error = (spa_has_checkpoint(spa)) ?
|
|
|
|
ZFS_ERR_CHECKPOINT_EXISTS : ZFS_ERR_DISCARDING_CHECKPOINT;
|
|
|
|
return (spa_vdev_exit(spa, NULL, txg, error));
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/* clear the log and flush everything up to now */
|
|
|
|
activate_slog = spa_passivate_log(spa);
|
|
|
|
(void) spa_vdev_config_exit(spa, NULL, txg, 0, FTAG);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
error = spa_reset_logs(spa);
|
2010-05-29 00:45:14 +04:00
|
|
|
txg = spa_vdev_config_enter(spa);
|
|
|
|
|
|
|
|
if (activate_slog)
|
|
|
|
spa_activate_log(spa);
|
|
|
|
|
|
|
|
if (error != 0)
|
|
|
|
return (spa_vdev_exit(spa, NULL, txg, error));
|
|
|
|
|
|
|
|
/* check new spa name before going any further */
|
|
|
|
if (spa_lookup(newname) != NULL)
|
|
|
|
return (spa_vdev_exit(spa, NULL, txg, EEXIST));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* scan through all the children to ensure they're all mirrors
|
|
|
|
*/
|
|
|
|
if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE, &nvl) != 0 ||
|
|
|
|
nvlist_lookup_nvlist_array(nvl, ZPOOL_CONFIG_CHILDREN, &child,
|
|
|
|
&children) != 0)
|
|
|
|
return (spa_vdev_exit(spa, NULL, txg, EINVAL));
|
|
|
|
|
|
|
|
/* first, check to ensure we've got the right child count */
|
|
|
|
rvd = spa->spa_root_vdev;
|
|
|
|
lastlog = 0;
|
|
|
|
for (c = 0; c < rvd->vdev_children; c++) {
|
|
|
|
vdev_t *vd = rvd->vdev_child[c];
|
|
|
|
|
|
|
|
/* don't count the holes & logs as children */
|
2020-05-06 20:32:28 +03:00
|
|
|
if (vd->vdev_islog || (vd->vdev_ops != &vdev_indirect_ops &&
|
|
|
|
!vdev_is_concrete(vd))) {
|
2010-05-29 00:45:14 +04:00
|
|
|
if (lastlog == 0)
|
|
|
|
lastlog = c;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
lastlog = 0;
|
|
|
|
}
|
|
|
|
if (children != (lastlog != 0 ? lastlog : rvd->vdev_children))
|
|
|
|
return (spa_vdev_exit(spa, NULL, txg, EINVAL));
|
|
|
|
|
|
|
|
/* next, ensure no spare or cache devices are part of the split */
|
|
|
|
if (nvlist_lookup_nvlist(nvl, ZPOOL_CONFIG_SPARES, &tmp) == 0 ||
|
|
|
|
nvlist_lookup_nvlist(nvl, ZPOOL_CONFIG_L2CACHE, &tmp) == 0)
|
|
|
|
return (spa_vdev_exit(spa, NULL, txg, EINVAL));
|
|
|
|
|
2014-11-21 03:09:39 +03:00
|
|
|
vml = kmem_zalloc(children * sizeof (vdev_t *), KM_SLEEP);
|
|
|
|
glist = kmem_zalloc(children * sizeof (uint64_t), KM_SLEEP);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
/* then, loop over each vdev and validate it */
|
|
|
|
for (c = 0; c < children; c++) {
|
|
|
|
uint64_t is_hole = 0;
|
|
|
|
|
|
|
|
(void) nvlist_lookup_uint64(child[c], ZPOOL_CONFIG_IS_HOLE,
|
|
|
|
&is_hole);
|
|
|
|
|
|
|
|
if (is_hole != 0) {
|
|
|
|
if (spa->spa_root_vdev->vdev_child[c]->vdev_ishole ||
|
|
|
|
spa->spa_root_vdev->vdev_child[c]->vdev_islog) {
|
|
|
|
continue;
|
|
|
|
} else {
|
2013-03-08 22:41:28 +04:00
|
|
|
error = SET_ERROR(EINVAL);
|
2010-05-29 00:45:14 +04:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-05-06 20:32:28 +03:00
|
|
|
/* deal with indirect vdevs */
|
|
|
|
if (spa->spa_root_vdev->vdev_child[c]->vdev_ops ==
|
|
|
|
&vdev_indirect_ops)
|
|
|
|
continue;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/* which disk is going to be split? */
|
|
|
|
if (nvlist_lookup_uint64(child[c], ZPOOL_CONFIG_GUID,
|
|
|
|
&glist[c]) != 0) {
|
2013-03-08 22:41:28 +04:00
|
|
|
error = SET_ERROR(EINVAL);
|
2010-05-29 00:45:14 +04:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* look it up in the spa */
|
|
|
|
vml[c] = spa_lookup_by_guid(spa, glist[c], B_FALSE);
|
|
|
|
if (vml[c] == NULL) {
|
2013-03-08 22:41:28 +04:00
|
|
|
error = SET_ERROR(ENODEV);
|
2010-05-29 00:45:14 +04:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* make sure there's nothing stopping the split */
|
|
|
|
if (vml[c]->vdev_parent->vdev_ops != &vdev_mirror_ops ||
|
|
|
|
vml[c]->vdev_islog ||
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
!vdev_is_concrete(vml[c]) ||
|
2010-05-29 00:45:14 +04:00
|
|
|
vml[c]->vdev_isspare ||
|
|
|
|
vml[c]->vdev_isl2cache ||
|
|
|
|
!vdev_writeable(vml[c]) ||
|
|
|
|
vml[c]->vdev_children != 0 ||
|
|
|
|
vml[c]->vdev_state != VDEV_STATE_HEALTHY ||
|
|
|
|
c != spa->spa_root_vdev->vdev_child[c]->vdev_id) {
|
2013-03-08 22:41:28 +04:00
|
|
|
error = SET_ERROR(EINVAL);
|
2010-05-29 00:45:14 +04:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2018-09-13 04:14:42 +03:00
|
|
|
if (vdev_dtl_required(vml[c]) ||
|
|
|
|
vdev_resilver_needed(vml[c], NULL, NULL)) {
|
2013-03-08 22:41:28 +04:00
|
|
|
error = SET_ERROR(EBUSY);
|
2010-05-29 00:45:14 +04:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* we need certain info from the top level */
|
2021-10-27 02:15:38 +03:00
|
|
|
fnvlist_add_uint64(child[c], ZPOOL_CONFIG_METASLAB_ARRAY,
|
|
|
|
vml[c]->vdev_top->vdev_ms_array);
|
|
|
|
fnvlist_add_uint64(child[c], ZPOOL_CONFIG_METASLAB_SHIFT,
|
|
|
|
vml[c]->vdev_top->vdev_ms_shift);
|
|
|
|
fnvlist_add_uint64(child[c], ZPOOL_CONFIG_ASIZE,
|
|
|
|
vml[c]->vdev_top->vdev_asize);
|
|
|
|
fnvlist_add_uint64(child[c], ZPOOL_CONFIG_ASHIFT,
|
|
|
|
vml[c]->vdev_top->vdev_ashift);
|
2016-04-11 23:16:57 +03:00
|
|
|
|
|
|
|
/* transfer per-vdev ZAPs */
|
|
|
|
ASSERT3U(vml[c]->vdev_leaf_zap, !=, 0);
|
|
|
|
VERIFY0(nvlist_add_uint64(child[c],
|
|
|
|
ZPOOL_CONFIG_VDEV_LEAF_ZAP, vml[c]->vdev_leaf_zap));
|
|
|
|
|
|
|
|
ASSERT3U(vml[c]->vdev_top->vdev_top_zap, !=, 0);
|
|
|
|
VERIFY0(nvlist_add_uint64(child[c],
|
|
|
|
ZPOOL_CONFIG_VDEV_TOP_ZAP,
|
|
|
|
vml[c]->vdev_parent->vdev_top_zap));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
if (error != 0) {
|
|
|
|
kmem_free(vml, children * sizeof (vdev_t *));
|
|
|
|
kmem_free(glist, children * sizeof (uint64_t));
|
|
|
|
return (spa_vdev_exit(spa, NULL, txg, error));
|
|
|
|
}
|
|
|
|
|
|
|
|
/* stop writers from using the disks */
|
|
|
|
for (c = 0; c < children; c++) {
|
|
|
|
if (vml[c] != NULL)
|
|
|
|
vml[c]->vdev_offline = B_TRUE;
|
|
|
|
}
|
|
|
|
vdev_reopen(spa->spa_root_vdev);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Temporarily record the splitting vdevs in the spa config. This
|
|
|
|
* will disappear once the config is regenerated.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2021-10-27 02:15:38 +03:00
|
|
|
nvl = fnvlist_alloc();
|
|
|
|
fnvlist_add_uint64_array(nvl, ZPOOL_CONFIG_SPLIT_LIST, glist, children);
|
2010-05-29 00:45:14 +04:00
|
|
|
kmem_free(glist, children * sizeof (uint64_t));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_enter(&spa->spa_props_lock);
|
2021-10-27 02:15:38 +03:00
|
|
|
fnvlist_add_nvlist(spa->spa_config, ZPOOL_CONFIG_SPLIT, nvl);
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_exit(&spa->spa_props_lock);
|
|
|
|
spa->spa_config_splitting = nvl;
|
|
|
|
vdev_config_dirty(spa->spa_root_vdev);
|
|
|
|
|
|
|
|
/* configure and create the new pool */
|
2021-10-27 02:15:38 +03:00
|
|
|
fnvlist_add_string(config, ZPOOL_CONFIG_POOL_NAME, newname);
|
|
|
|
fnvlist_add_uint64(config, ZPOOL_CONFIG_POOL_STATE,
|
|
|
|
exp ? POOL_STATE_EXPORTED : POOL_STATE_ACTIVE);
|
|
|
|
fnvlist_add_uint64(config, ZPOOL_CONFIG_VERSION, spa_version(spa));
|
|
|
|
fnvlist_add_uint64(config, ZPOOL_CONFIG_POOL_TXG, spa->spa_config_txg);
|
|
|
|
fnvlist_add_uint64(config, ZPOOL_CONFIG_POOL_GUID,
|
|
|
|
spa_generate_guid(NULL));
|
2016-04-11 23:16:57 +03:00
|
|
|
VERIFY0(nvlist_add_boolean(config, ZPOOL_CONFIG_HAS_PER_VDEV_ZAPS));
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) nvlist_lookup_string(props,
|
|
|
|
zpool_prop_to_name(ZPOOL_PROP_ALTROOT), &altroot);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/* add the new pool to the namespace */
|
|
|
|
newspa = spa_add(newname, config, altroot);
|
2016-04-11 23:16:57 +03:00
|
|
|
newspa->spa_avz_action = AVZ_ACTION_REBUILD;
|
2010-05-29 00:45:14 +04:00
|
|
|
newspa->spa_config_txg = spa->spa_config_txg;
|
|
|
|
spa_set_log_state(newspa, SPA_LOG_CLEAR);
|
|
|
|
|
|
|
|
/* release the spa config lock, retaining the namespace lock */
|
|
|
|
spa_vdev_config_exit(spa, NULL, txg, 0, FTAG);
|
|
|
|
|
|
|
|
if (zio_injection_enabled)
|
|
|
|
zio_handle_panic_injection(spa, FTAG, 1);
|
|
|
|
|
|
|
|
spa_activate(newspa, spa_mode_global);
|
|
|
|
spa_async_suspend(newspa);
|
|
|
|
|
2018-12-19 19:20:39 +03:00
|
|
|
/*
|
2019-03-29 19:13:20 +03:00
|
|
|
* Temporarily stop the initializing and TRIM activity. We set the
|
|
|
|
* state to ACTIVE so that we know to resume initializing or TRIM
|
|
|
|
* once the split has completed.
|
2018-12-19 19:20:39 +03:00
|
|
|
*/
|
2019-03-29 19:13:20 +03:00
|
|
|
list_t vd_initialize_list;
|
|
|
|
list_create(&vd_initialize_list, sizeof (vdev_t),
|
2018-12-19 19:20:39 +03:00
|
|
|
offsetof(vdev_t, vdev_initialize_node));
|
|
|
|
|
2019-03-29 19:13:20 +03:00
|
|
|
list_t vd_trim_list;
|
|
|
|
list_create(&vd_trim_list, sizeof (vdev_t),
|
|
|
|
offsetof(vdev_t, vdev_trim_node));
|
|
|
|
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
for (c = 0; c < children; c++) {
|
2020-05-06 20:32:28 +03:00
|
|
|
if (vml[c] != NULL && vml[c]->vdev_ops != &vdev_indirect_ops) {
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
mutex_enter(&vml[c]->vdev_initialize_lock);
|
2019-03-29 19:13:20 +03:00
|
|
|
vdev_initialize_stop(vml[c],
|
|
|
|
VDEV_INITIALIZE_ACTIVE, &vd_initialize_list);
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
mutex_exit(&vml[c]->vdev_initialize_lock);
|
2019-03-29 19:13:20 +03:00
|
|
|
|
|
|
|
mutex_enter(&vml[c]->vdev_trim_lock);
|
|
|
|
vdev_trim_stop(vml[c], VDEV_TRIM_ACTIVE, &vd_trim_list);
|
|
|
|
mutex_exit(&vml[c]->vdev_trim_lock);
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
}
|
|
|
|
}
|
2019-03-29 19:13:20 +03:00
|
|
|
|
|
|
|
vdev_initialize_stop_wait(spa, &vd_initialize_list);
|
|
|
|
vdev_trim_stop_wait(spa, &vd_trim_list);
|
|
|
|
|
|
|
|
list_destroy(&vd_initialize_list);
|
|
|
|
list_destroy(&vd_trim_list);
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
newspa->spa_config_source = SPA_CONFIG_SRC_SPLIT;
|
2020-04-09 19:59:31 +03:00
|
|
|
newspa->spa_is_splitting = B_TRUE;
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/* create the new pool from the disks of the original pool */
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
error = spa_load(newspa, SPA_LOAD_IMPORT, SPA_IMPORT_ASSEMBLE);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (error)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
/* if that worked, generate a real config for the new pool */
|
|
|
|
if (newspa->spa_root_vdev != NULL) {
|
2021-10-27 02:15:38 +03:00
|
|
|
newspa->spa_config_splitting = fnvlist_alloc();
|
|
|
|
fnvlist_add_uint64(newspa->spa_config_splitting,
|
|
|
|
ZPOOL_CONFIG_SPLIT_GUID, spa_guid(spa));
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_config_set(newspa, spa_config_generate(newspa, NULL, -1ULL,
|
|
|
|
B_TRUE));
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/* set the props */
|
|
|
|
if (props != NULL) {
|
|
|
|
spa_configfile_set(newspa, props, B_FALSE);
|
|
|
|
error = spa_prop_set(newspa, props);
|
|
|
|
if (error)
|
|
|
|
goto out;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/* flush everything */
|
|
|
|
txg = spa_vdev_config_enter(newspa);
|
|
|
|
vdev_config_dirty(newspa->spa_root_vdev);
|
|
|
|
(void) spa_vdev_config_exit(newspa, NULL, txg, 0, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (zio_injection_enabled)
|
|
|
|
zio_handle_panic_injection(spa, FTAG, 2);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_async_resume(newspa);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/* finally, update the original pool's config */
|
|
|
|
txg = spa_vdev_config_enter(spa);
|
|
|
|
tx = dmu_tx_create_dd(spa_get_dsl(spa)->dp_mos_dir);
|
|
|
|
error = dmu_tx_assign(tx, TXG_WAIT);
|
|
|
|
if (error != 0)
|
|
|
|
dmu_tx_abort(tx);
|
|
|
|
for (c = 0; c < children; c++) {
|
2020-05-06 20:32:28 +03:00
|
|
|
if (vml[c] != NULL && vml[c]->vdev_ops != &vdev_indirect_ops) {
|
2019-03-22 23:11:36 +03:00
|
|
|
vdev_t *tvd = vml[c]->vdev_top;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Need to be sure the detachable VDEV is not
|
|
|
|
* on any *other* txg's DTL list to prevent it
|
|
|
|
* from being accessed after it's freed.
|
|
|
|
*/
|
|
|
|
for (int t = 0; t < TXG_SIZE; t++) {
|
|
|
|
(void) txg_list_remove_this(
|
|
|
|
&tvd->vdev_dtl_list, vml[c], t);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
vdev_split(vml[c]);
|
|
|
|
if (error == 0)
|
2013-08-28 15:45:09 +04:00
|
|
|
spa_history_log_internal(spa, "detach", tx,
|
|
|
|
"vdev=%s", vml[c]->vdev_path);
|
2016-04-11 23:16:57 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
vdev_free(vml[c]);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
2016-04-11 23:16:57 +03:00
|
|
|
spa->spa_avz_action = AVZ_ACTION_REBUILD;
|
2010-05-29 00:45:14 +04:00
|
|
|
vdev_config_dirty(spa->spa_root_vdev);
|
|
|
|
spa->spa_config_splitting = NULL;
|
|
|
|
nvlist_free(nvl);
|
|
|
|
if (error == 0)
|
|
|
|
dmu_tx_commit(tx);
|
|
|
|
(void) spa_vdev_exit(spa, NULL, txg, 0);
|
|
|
|
|
|
|
|
if (zio_injection_enabled)
|
|
|
|
zio_handle_panic_injection(spa, FTAG, 3);
|
|
|
|
|
|
|
|
/* split is complete; log a history record */
|
2013-08-28 15:45:09 +04:00
|
|
|
spa_history_log_internal(newspa, "split", NULL,
|
|
|
|
"from pool %s", spa_name(spa));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2020-04-09 19:59:31 +03:00
|
|
|
newspa->spa_is_splitting = B_FALSE;
|
2010-05-29 00:45:14 +04:00
|
|
|
kmem_free(vml, children * sizeof (vdev_t *));
|
|
|
|
|
|
|
|
/* if we're not going to mount the filesystems in userland, export */
|
|
|
|
if (exp)
|
|
|
|
error = spa_export_common(newname, POOL_STATE_EXPORTED, NULL,
|
|
|
|
B_FALSE, B_FALSE);
|
|
|
|
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
out:
|
|
|
|
spa_unload(newspa);
|
|
|
|
spa_deactivate(newspa);
|
|
|
|
spa_remove(newspa);
|
|
|
|
|
|
|
|
txg = spa_vdev_config_enter(spa);
|
|
|
|
|
|
|
|
/* re-online all offlined disks */
|
|
|
|
for (c = 0; c < children; c++) {
|
|
|
|
if (vml[c] != NULL)
|
|
|
|
vml[c]->vdev_offline = B_FALSE;
|
|
|
|
}
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
|
2019-03-29 19:13:20 +03:00
|
|
|
/* restart initializing or trimming disks as necessary */
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
spa_async_request(spa, SPA_ASYNC_INITIALIZE_RESTART);
|
2019-03-29 19:13:20 +03:00
|
|
|
spa_async_request(spa, SPA_ASYNC_TRIM_RESTART);
|
|
|
|
spa_async_request(spa, SPA_ASYNC_AUTOTRIM_RESTART);
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
vdev_reopen(spa->spa_root_vdev);
|
|
|
|
|
|
|
|
nvlist_free(spa->spa_config_splitting);
|
|
|
|
spa->spa_config_splitting = NULL;
|
|
|
|
(void) spa_vdev_exit(spa, NULL, txg, error);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
kmem_free(vml, children * sizeof (vdev_t *));
|
2008-11-20 23:01:55 +03:00
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Find any device that's done replacing, or a vdev marked 'unspare' that's
|
2013-06-11 21:12:34 +04:00
|
|
|
* currently spared, so we can detach it.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
static vdev_t *
|
|
|
|
spa_vdev_resilver_done_hunt(vdev_t *vd)
|
|
|
|
{
|
|
|
|
vdev_t *newvd, *oldvd;
|
|
|
|
|
2017-11-04 23:25:13 +03:00
|
|
|
for (int c = 0; c < vd->vdev_children; c++) {
|
2008-11-20 23:01:55 +03:00
|
|
|
oldvd = spa_vdev_resilver_done_hunt(vd->vdev_child[c]);
|
|
|
|
if (oldvd != NULL)
|
|
|
|
return (oldvd);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2010-08-27 01:24:34 +04:00
|
|
|
* Check for a completed replacement. We always consider the first
|
|
|
|
* vdev in the list to be the oldest vdev, and the last one to be
|
|
|
|
* the newest (see spa_vdev_attach() for how that works). In
|
|
|
|
* the case where the newest vdev is faulted, we will not automatically
|
|
|
|
* remove it after a resilver completes. This is OK as it will require
|
|
|
|
* user intervention to determine which disk the admin wishes to keep.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2010-08-27 01:24:34 +04:00
|
|
|
if (vd->vdev_ops == &vdev_replacing_ops) {
|
|
|
|
ASSERT(vd->vdev_children > 1);
|
|
|
|
|
|
|
|
newvd = vd->vdev_child[vd->vdev_children - 1];
|
2008-11-20 23:01:55 +03:00
|
|
|
oldvd = vd->vdev_child[0];
|
|
|
|
|
2009-01-16 00:59:39 +03:00
|
|
|
if (vdev_dtl_empty(newvd, DTL_MISSING) &&
|
2010-05-29 00:45:14 +04:00
|
|
|
vdev_dtl_empty(newvd, DTL_OUTAGE) &&
|
2009-01-16 00:59:39 +03:00
|
|
|
!vdev_dtl_required(oldvd))
|
2008-11-20 23:01:55 +03:00
|
|
|
return (oldvd);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check for a completed resilver with the 'unspare' flag set.
|
2018-08-30 16:13:30 +03:00
|
|
|
* Also potentially update faulted state.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2010-08-27 01:24:34 +04:00
|
|
|
if (vd->vdev_ops == &vdev_spare_ops) {
|
|
|
|
vdev_t *first = vd->vdev_child[0];
|
|
|
|
vdev_t *last = vd->vdev_child[vd->vdev_children - 1];
|
|
|
|
|
|
|
|
if (last->vdev_unspare) {
|
|
|
|
oldvd = first;
|
|
|
|
newvd = last;
|
|
|
|
} else if (first->vdev_unspare) {
|
|
|
|
oldvd = last;
|
|
|
|
newvd = first;
|
|
|
|
} else {
|
|
|
|
oldvd = NULL;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
if (oldvd != NULL &&
|
2009-01-16 00:59:39 +03:00
|
|
|
vdev_dtl_empty(newvd, DTL_MISSING) &&
|
2010-05-29 00:45:14 +04:00
|
|
|
vdev_dtl_empty(newvd, DTL_OUTAGE) &&
|
2010-08-27 01:24:34 +04:00
|
|
|
!vdev_dtl_required(oldvd))
|
2008-11-20 23:01:55 +03:00
|
|
|
return (oldvd);
|
2010-08-27 01:24:34 +04:00
|
|
|
|
2018-08-30 16:13:30 +03:00
|
|
|
vdev_propagate_state(vd);
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
/*
|
|
|
|
* If there are more than two spares attached to a disk,
|
|
|
|
* and those spares are not required, then we want to
|
|
|
|
* attempt to free them up now so that they can be used
|
|
|
|
* by other pools. Once we're back down to a single
|
|
|
|
* disk+spare, we stop removing them.
|
|
|
|
*/
|
|
|
|
if (vd->vdev_children > 2) {
|
|
|
|
newvd = vd->vdev_child[1];
|
|
|
|
|
|
|
|
if (newvd->vdev_isspare && last->vdev_isspare &&
|
|
|
|
vdev_dtl_empty(last, DTL_MISSING) &&
|
|
|
|
vdev_dtl_empty(last, DTL_OUTAGE) &&
|
|
|
|
!vdev_dtl_required(newvd))
|
|
|
|
return (newvd);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
spa_vdev_resilver_done(spa_t *spa)
|
|
|
|
{
|
2009-01-16 00:59:39 +03:00
|
|
|
vdev_t *vd, *pvd, *ppvd;
|
|
|
|
uint64_t guid, sguid, pguid, ppguid;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2009-01-16 00:59:39 +03:00
|
|
|
spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
while ((vd = spa_vdev_resilver_done_hunt(spa->spa_root_vdev)) != NULL) {
|
2009-01-16 00:59:39 +03:00
|
|
|
pvd = vd->vdev_parent;
|
|
|
|
ppvd = pvd->vdev_parent;
|
2008-11-20 23:01:55 +03:00
|
|
|
guid = vd->vdev_guid;
|
2009-01-16 00:59:39 +03:00
|
|
|
pguid = pvd->vdev_guid;
|
|
|
|
ppguid = ppvd->vdev_guid;
|
|
|
|
sguid = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* If we have just finished replacing a hot spared device, then
|
|
|
|
* we need to detach the parent's first child (the original hot
|
|
|
|
* spare) as well.
|
|
|
|
*/
|
2010-08-27 01:24:34 +04:00
|
|
|
if (ppvd->vdev_ops == &vdev_spare_ops && pvd->vdev_id == 0 &&
|
|
|
|
ppvd->vdev_children == 2) {
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(pvd->vdev_ops == &vdev_replacing_ops);
|
2009-01-16 00:59:39 +03:00
|
|
|
sguid = ppvd->vdev_child[1]->vdev_guid;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2013-08-08 00:16:22 +04:00
|
|
|
ASSERT(vd->vdev_resilver_txg == 0 || !vdev_dtl_required(vd));
|
|
|
|
|
2009-01-16 00:59:39 +03:00
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
|
|
|
if (spa_vdev_detach(spa, guid, pguid, B_TRUE) != 0)
|
2008-11-20 23:01:55 +03:00
|
|
|
return;
|
2009-01-16 00:59:39 +03:00
|
|
|
if (sguid && spa_vdev_detach(spa, sguid, ppguid, B_TRUE) != 0)
|
2008-11-20 23:01:55 +03:00
|
|
|
return;
|
2009-01-16 00:59:39 +03:00
|
|
|
spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2009-01-16 00:59:39 +03:00
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
2020-07-03 21:05:50 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If a detach was not performed above replace waiters will not have
|
|
|
|
* been notified. In which case we must do so now.
|
|
|
|
*/
|
|
|
|
spa_notify_waiters(spa);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Update the stored path or FRU for this vdev.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2020-06-15 21:30:37 +03:00
|
|
|
static int
|
2009-07-03 02:44:48 +04:00
|
|
|
spa_vdev_set_common(spa_t *spa, uint64_t guid, const char *value,
|
|
|
|
boolean_t ispath)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2008-12-03 23:09:06 +03:00
|
|
|
vdev_t *vd;
|
2010-05-29 00:45:14 +04:00
|
|
|
boolean_t sync = B_FALSE;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
ASSERT(spa_writeable(spa));
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_vdev_state_enter(spa, SCL_ALL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
if ((vd = spa_lookup_by_guid(spa, guid, B_TRUE)) == NULL)
|
2010-05-29 00:45:14 +04:00
|
|
|
return (spa_vdev_state_exit(spa, NULL, ENOENT));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (!vd->vdev_ops->vdev_op_leaf)
|
2010-05-29 00:45:14 +04:00
|
|
|
return (spa_vdev_state_exit(spa, NULL, ENOTSUP));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
if (ispath) {
|
2010-05-29 00:45:14 +04:00
|
|
|
if (strcmp(value, vd->vdev_path) != 0) {
|
|
|
|
spa_strfree(vd->vdev_path);
|
|
|
|
vd->vdev_path = spa_strdup(value);
|
|
|
|
sync = B_TRUE;
|
|
|
|
}
|
2009-07-03 02:44:48 +04:00
|
|
|
} else {
|
2010-05-29 00:45:14 +04:00
|
|
|
if (vd->vdev_fru == NULL) {
|
|
|
|
vd->vdev_fru = spa_strdup(value);
|
|
|
|
sync = B_TRUE;
|
|
|
|
} else if (strcmp(value, vd->vdev_fru) != 0) {
|
2009-07-03 02:44:48 +04:00
|
|
|
spa_strfree(vd->vdev_fru);
|
2010-05-29 00:45:14 +04:00
|
|
|
vd->vdev_fru = spa_strdup(value);
|
|
|
|
sync = B_TRUE;
|
|
|
|
}
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
return (spa_vdev_state_exit(spa, sync ? vd : NULL, 0));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
int
|
|
|
|
spa_vdev_setpath(spa_t *spa, uint64_t guid, const char *newpath)
|
|
|
|
{
|
|
|
|
return (spa_vdev_set_common(spa, guid, newpath, B_TRUE));
|
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
spa_vdev_setfru(spa_t *spa, uint64_t guid, const char *newfru)
|
|
|
|
{
|
|
|
|
return (spa_vdev_set_common(spa, guid, newfru, B_FALSE));
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* ==========================================================================
|
2010-05-29 00:45:14 +04:00
|
|
|
* SPA Scanning
|
2008-11-20 23:01:55 +03:00
|
|
|
* ==========================================================================
|
|
|
|
*/
|
2017-07-07 08:16:13 +03:00
|
|
|
int
|
|
|
|
spa_scrub_pause_resume(spa_t *spa, pool_scrub_cmd_t cmd)
|
|
|
|
{
|
|
|
|
ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == 0);
|
|
|
|
|
|
|
|
if (dsl_scan_resilvering(spa->spa_dsl_pool))
|
|
|
|
return (SET_ERROR(EBUSY));
|
|
|
|
|
|
|
|
return (dsl_scrub_set_pause_resume(spa->spa_dsl_pool, cmd));
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
int
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_scan_stop(spa_t *spa)
|
|
|
|
{
|
|
|
|
ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == 0);
|
|
|
|
if (dsl_scan_resilvering(spa->spa_dsl_pool))
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(EBUSY));
|
2010-05-29 00:45:14 +04:00
|
|
|
return (dsl_scan_cancel(spa->spa_dsl_pool));
|
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
spa_scan(spa_t *spa, pool_scan_func_t func)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2008-12-03 23:09:06 +03:00
|
|
|
ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (func >= POOL_SCAN_FUNCS || func == POOL_SCAN_NONE)
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(ENOTSUP));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2019-05-03 02:42:31 +03:00
|
|
|
if (func == POOL_SCAN_RESILVER &&
|
|
|
|
!spa_feature_is_enabled(spa, SPA_FEATURE_RESILVER_DEFER))
|
|
|
|
return (SET_ERROR(ENOTSUP));
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* If a resilver was requested, but there is no DTL on a
|
|
|
|
* writeable leaf device, we have nothing to do.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
if (func == POOL_SCAN_RESILVER &&
|
2008-12-03 23:09:06 +03:00
|
|
|
!vdev_resilver_needed(spa->spa_root_vdev, NULL, NULL)) {
|
|
|
|
spa_async_request(spa, SPA_ASYNC_RESILVER_DONE);
|
2008-11-20 23:01:55 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
return (dsl_scan(spa->spa_dsl_pool, func));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ==========================================================================
|
|
|
|
* SPA async task processing
|
|
|
|
* ==========================================================================
|
|
|
|
*/
|
|
|
|
|
|
|
|
static void
|
|
|
|
spa_async_remove(spa_t *spa, vdev_t *vd)
|
|
|
|
{
|
2008-12-03 23:09:06 +03:00
|
|
|
if (vd->vdev_remove_wanted) {
|
2010-05-29 00:45:14 +04:00
|
|
|
vd->vdev_remove_wanted = B_FALSE;
|
|
|
|
vd->vdev_delayed_close = B_FALSE;
|
2008-12-03 23:09:06 +03:00
|
|
|
vdev_set_state(vd, B_FALSE, VDEV_STATE_REMOVED, VDEV_AUX_NONE);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We want to clear the stats, but we don't want to do a full
|
|
|
|
* vdev_clear() as that will cause us to throw away
|
|
|
|
* degraded/faulted state as well as attempt to reopen the
|
|
|
|
* device, all of which is a waste.
|
|
|
|
*/
|
|
|
|
vd->vdev_stat.vs_read_errors = 0;
|
|
|
|
vd->vdev_stat.vs_write_errors = 0;
|
|
|
|
vd->vdev_stat.vs_checksum_errors = 0;
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
vdev_state_dirty(vd->vdev_top);
|
2020-12-02 21:20:02 +03:00
|
|
|
|
|
|
|
/* Tell userspace that the vdev is gone. */
|
|
|
|
zfs_post_remove(spa, vd);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-11-04 23:25:13 +03:00
|
|
|
for (int c = 0; c < vd->vdev_children; c++)
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_async_remove(spa, vd->vdev_child[c]);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
spa_async_probe(spa_t *spa, vdev_t *vd)
|
|
|
|
{
|
|
|
|
if (vd->vdev_probe_wanted) {
|
2010-05-29 00:45:14 +04:00
|
|
|
vd->vdev_probe_wanted = B_FALSE;
|
2008-12-03 23:09:06 +03:00
|
|
|
vdev_reopen(vd); /* vdev_open() does the actual probe */
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2017-11-04 23:25:13 +03:00
|
|
|
for (int c = 0; c < vd->vdev_children; c++)
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_async_probe(spa, vd->vdev_child[c]);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
static void
|
|
|
|
spa_async_autoexpand(spa_t *spa, vdev_t *vd)
|
|
|
|
{
|
|
|
|
if (!spa->spa_autoexpand)
|
|
|
|
return;
|
|
|
|
|
2017-11-04 23:25:13 +03:00
|
|
|
for (int c = 0; c < vd->vdev_children; c++) {
|
2009-07-03 02:44:48 +04:00
|
|
|
vdev_t *cvd = vd->vdev_child[c];
|
|
|
|
spa_async_autoexpand(spa, cvd);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!vd->vdev_ops->vdev_op_leaf || vd->vdev_physpath == NULL)
|
|
|
|
return;
|
|
|
|
|
2017-05-30 21:39:17 +03:00
|
|
|
spa_event_notify(vd->vdev_spa, vd, NULL, ESC_ZFS_VDEV_AUTOEXPAND);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
|
2022-03-23 18:51:00 +03:00
|
|
|
static __attribute__((noreturn)) void
|
Simplify threads, mutexs, cvs and rwlocks
* Simplify threads, mutexs, cvs and rwlocks
* Update the zk_thread_create() function to use the same trick
as Illumos. Specifically, cast the new pthread_t to a void
pointer and return that as the kthread_t *. This avoids the
issues associated with managing a wrapper structure and is
safe as long as the callers never attempt to dereference it.
* Update all function prototypes passed to pthread_create() to
match the expected prototype. We were getting away this with
before since the function were explicitly cast.
* Replaced direct zk_thread_create() calls with thread_create()
for code consistency. All consumers of libzpool now use the
proper wrappers.
* The mutex_held() calls were converted to MUTEX_HELD().
* Removed all mutex_owner() calls and retired the interface.
Instead use MUTEX_HELD() which provides the same information
and allows the implementation details to be hidden. In this
case the use of the pthread_equals() function.
* The kthread_t, kmutex_t, krwlock_t, and krwlock_t types had
any non essential fields removed. In the case of kthread_t
and kcondvar_t they could be directly typedef'd to pthread_t
and pthread_cond_t respectively.
* Removed all extra ASSERTS from the thread, mutex, rwlock, and
cv wrapper functions. In practice, pthreads already provides
the vast majority of checks as long as we check the return
code. Removing this code from our wrappers help readability.
* Added TS_JOINABLE state flag to pass to request a joinable rather
than detached thread. This isn't a standard thread_create() state
but it's the least invasive way to pass this information and is
only used by ztest.
TEST_ZTEST_TIMEOUT=3600
Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4547
Closes #5503
Closes #5523
Closes #6377
Closes #6495
2017-08-11 18:51:44 +03:00
|
|
|
spa_async_thread(void *arg)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
Simplify threads, mutexs, cvs and rwlocks
* Simplify threads, mutexs, cvs and rwlocks
* Update the zk_thread_create() function to use the same trick
as Illumos. Specifically, cast the new pthread_t to a void
pointer and return that as the kthread_t *. This avoids the
issues associated with managing a wrapper structure and is
safe as long as the callers never attempt to dereference it.
* Update all function prototypes passed to pthread_create() to
match the expected prototype. We were getting away this with
before since the function were explicitly cast.
* Replaced direct zk_thread_create() calls with thread_create()
for code consistency. All consumers of libzpool now use the
proper wrappers.
* The mutex_held() calls were converted to MUTEX_HELD().
* Removed all mutex_owner() calls and retired the interface.
Instead use MUTEX_HELD() which provides the same information
and allows the implementation details to be hidden. In this
case the use of the pthread_equals() function.
* The kthread_t, kmutex_t, krwlock_t, and krwlock_t types had
any non essential fields removed. In the case of kthread_t
and kcondvar_t they could be directly typedef'd to pthread_t
and pthread_cond_t respectively.
* Removed all extra ASSERTS from the thread, mutex, rwlock, and
cv wrapper functions. In practice, pthreads already provides
the vast majority of checks as long as we check the return
code. Removing this code from our wrappers help readability.
* Added TS_JOINABLE state flag to pass to request a joinable rather
than detached thread. This isn't a standard thread_create() state
but it's the least invasive way to pass this information and is
only used by ztest.
TEST_ZTEST_TIMEOUT=3600
Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4547
Closes #5503
Closes #5523
Closes #6377
Closes #6495
2017-08-11 18:51:44 +03:00
|
|
|
spa_t *spa = (spa_t *)arg;
|
2018-10-19 07:06:18 +03:00
|
|
|
dsl_pool_t *dp = spa->spa_dsl_pool;
|
2017-10-27 22:46:35 +03:00
|
|
|
int tasks;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT(spa->spa_sync_on);
|
|
|
|
|
|
|
|
mutex_enter(&spa->spa_async_lock);
|
|
|
|
tasks = spa->spa_async_tasks;
|
|
|
|
spa->spa_async_tasks = 0;
|
|
|
|
mutex_exit(&spa->spa_async_lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* See if the config needs to be updated.
|
|
|
|
*/
|
|
|
|
if (tasks & SPA_ASYNC_CONFIG_UPDATE) {
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t old_space, new_space;
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_enter(&spa_namespace_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
old_space = metaslab_class_get_space(spa_normal_class(spa));
|
2018-09-06 04:33:36 +03:00
|
|
|
old_space += metaslab_class_get_space(spa_special_class(spa));
|
|
|
|
old_space += metaslab_class_get_space(spa_dedup_class(spa));
|
Set aside a metaslab for ZIL blocks
Mixing ZIL and normal allocations has several problems:
1. The ZIL allocations are allocated, written to disk, and then a few
seconds later freed. This leaves behind holes (free segments) where the
ZIL blocks used to be, which increases fragmentation, which negatively
impacts performance.
2. When under moderate load, ZIL allocations are of 128KB. If the pool
is fairly fragmented, there may not be many free chunks of that size.
This causes ZFS to load more metaslabs to locate free segments of 128KB
or more. The loading happens synchronously (from zil_commit()), and can
take around a second even if the metaslab's spacemap is cached in the
ARC. All concurrent synchronous operations on this filesystem must wait
while the metaslab is loading. This can cause a significant performance
impact.
3. If the pool is very fragmented, there may be zero free chunks of
128KB or more. In this case, the ZIL falls back to txg_wait_synced(),
which has an enormous performance impact.
These problems can be eliminated by using a dedicated log device
("slog"), even one with the same performance characteristics as the
normal devices.
This change sets aside one metaslab from each top-level vdev that is
preferentially used for ZIL allocations (vdev_log_mg,
spa_embedded_log_class). From an allocation perspective, this is
similar to having a dedicated log device, and it eliminates the
above-mentioned performance problems.
Log (ZIL) blocks can be allocated from the following locations. Each
one is tried in order until the allocation succeeds:
1. dedicated log vdevs, aka "slog" (spa_log_class)
2. embedded slog metaslabs (spa_embedded_log_class)
3. other metaslabs in normal vdevs (spa_normal_class)
The space required for the embedded slog metaslabs is usually between
0.5% and 1.0% of the pool, and comes out of the existing 3.2% of "slop"
space that is not available for user data.
On an all-ssd system with 4TB storage, 87% fragmentation, 60% capacity,
and recordsize=8k, testing shows a ~50% performance increase on random
8k sync writes. On even more fragmented systems (which hit problem #3
above and call txg_wait_synced()), the performance improvement can be
arbitrarily large (>100x).
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11389
2021-01-22 02:12:54 +03:00
|
|
|
old_space += metaslab_class_get_space(
|
|
|
|
spa_embedded_log_class(spa));
|
2018-09-06 04:33:36 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
spa_config_update(spa, SPA_CONFIG_UPDATE_POOL);
|
2018-09-06 04:33:36 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
new_space = metaslab_class_get_space(spa_normal_class(spa));
|
2018-09-06 04:33:36 +03:00
|
|
|
new_space += metaslab_class_get_space(spa_special_class(spa));
|
|
|
|
new_space += metaslab_class_get_space(spa_dedup_class(spa));
|
Set aside a metaslab for ZIL blocks
Mixing ZIL and normal allocations has several problems:
1. The ZIL allocations are allocated, written to disk, and then a few
seconds later freed. This leaves behind holes (free segments) where the
ZIL blocks used to be, which increases fragmentation, which negatively
impacts performance.
2. When under moderate load, ZIL allocations are of 128KB. If the pool
is fairly fragmented, there may not be many free chunks of that size.
This causes ZFS to load more metaslabs to locate free segments of 128KB
or more. The loading happens synchronously (from zil_commit()), and can
take around a second even if the metaslab's spacemap is cached in the
ARC. All concurrent synchronous operations on this filesystem must wait
while the metaslab is loading. This can cause a significant performance
impact.
3. If the pool is very fragmented, there may be zero free chunks of
128KB or more. In this case, the ZIL falls back to txg_wait_synced(),
which has an enormous performance impact.
These problems can be eliminated by using a dedicated log device
("slog"), even one with the same performance characteristics as the
normal devices.
This change sets aside one metaslab from each top-level vdev that is
preferentially used for ZIL allocations (vdev_log_mg,
spa_embedded_log_class). From an allocation perspective, this is
similar to having a dedicated log device, and it eliminates the
above-mentioned performance problems.
Log (ZIL) blocks can be allocated from the following locations. Each
one is tried in order until the allocation succeeds:
1. dedicated log vdevs, aka "slog" (spa_log_class)
2. embedded slog metaslabs (spa_embedded_log_class)
3. other metaslabs in normal vdevs (spa_normal_class)
The space required for the embedded slog metaslabs is usually between
0.5% and 1.0% of the pool, and comes out of the existing 3.2% of "slop"
space that is not available for user data.
On an all-ssd system with 4TB storage, 87% fragmentation, 60% capacity,
and recordsize=8k, testing shows a ~50% performance increase on random
8k sync writes. On even more fragmented systems (which hit problem #3
above and call txg_wait_synced()), the performance improvement can be
arbitrarily large (>100x).
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11389
2021-01-22 02:12:54 +03:00
|
|
|
new_space += metaslab_class_get_space(
|
|
|
|
spa_embedded_log_class(spa));
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(&spa_namespace_lock);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the pool grew as a result of the config update,
|
|
|
|
* then log an internal history event.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
if (new_space != old_space) {
|
2013-08-28 15:45:09 +04:00
|
|
|
spa_history_log_internal(spa, "vdev online", NULL,
|
2009-08-18 22:43:27 +04:00
|
|
|
"pool '%s' size: %llu(+%llu)",
|
2019-09-12 23:28:26 +03:00
|
|
|
spa_name(spa), (u_longlong_t)new_space,
|
|
|
|
(u_longlong_t)(new_space - old_space));
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* See if any devices need to be marked REMOVED.
|
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
if (tasks & SPA_ASYNC_REMOVE) {
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_vdev_state_enter(spa, SCL_NONE);
|
2008-11-20 23:01:55 +03:00
|
|
|
spa_async_remove(spa, spa->spa_root_vdev);
|
2017-10-27 22:46:35 +03:00
|
|
|
for (int i = 0; i < spa->spa_l2cache.sav_count; i++)
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_async_remove(spa, spa->spa_l2cache.sav_vdevs[i]);
|
2017-10-27 22:46:35 +03:00
|
|
|
for (int i = 0; i < spa->spa_spares.sav_count; i++)
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_async_remove(spa, spa->spa_spares.sav_vdevs[i]);
|
|
|
|
(void) spa_vdev_state_exit(spa, NULL, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
if ((tasks & SPA_ASYNC_AUTOEXPAND) && !spa_suspended(spa)) {
|
|
|
|
spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
|
|
|
|
spa_async_autoexpand(spa, spa->spa_root_vdev);
|
|
|
|
spa_config_exit(spa, SCL_CONFIG, FTAG);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* See if any devices need to be probed.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
if (tasks & SPA_ASYNC_PROBE) {
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_vdev_state_enter(spa, SCL_NONE);
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_async_probe(spa, spa->spa_root_vdev);
|
|
|
|
(void) spa_vdev_state_exit(spa, NULL, 0);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* If any devices are done replacing, detach them.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
if (tasks & SPA_ASYNC_RESILVER_DONE ||
|
|
|
|
tasks & SPA_ASYNC_REBUILD_DONE) {
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_vdev_resilver_done(spa);
|
2020-07-03 21:05:50 +03:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Kick off a resilver.
|
|
|
|
*/
|
2018-10-19 07:06:18 +03:00
|
|
|
if (tasks & SPA_ASYNC_RESILVER &&
|
2020-07-03 21:05:50 +03:00
|
|
|
!vdev_rebuild_active(spa->spa_root_vdev) &&
|
2018-10-19 07:06:18 +03:00
|
|
|
(!dsl_scan_resilvering(dp) ||
|
|
|
|
!spa_feature_is_enabled(dp->dp_spa, SPA_FEATURE_RESILVER_DEFER)))
|
2019-11-27 21:15:01 +03:00
|
|
|
dsl_scan_restart_resilver(dp, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
if (tasks & SPA_ASYNC_INITIALIZE_RESTART) {
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
|
|
|
|
vdev_initialize_restart(spa->spa_root_vdev);
|
|
|
|
spa_config_exit(spa, SCL_CONFIG, FTAG);
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
}
|
|
|
|
|
2019-03-29 19:13:20 +03:00
|
|
|
if (tasks & SPA_ASYNC_TRIM_RESTART) {
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
|
|
|
|
vdev_trim_restart(spa->spa_root_vdev);
|
|
|
|
spa_config_exit(spa, SCL_CONFIG, FTAG);
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (tasks & SPA_ASYNC_AUTOTRIM_RESTART) {
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
|
|
|
|
vdev_autotrim_restart(spa);
|
|
|
|
spa_config_exit(spa, SCL_CONFIG, FTAG);
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
}
|
|
|
|
|
2020-06-09 20:15:08 +03:00
|
|
|
/*
|
|
|
|
* Kick off L2 cache whole device TRIM.
|
|
|
|
*/
|
|
|
|
if (tasks & SPA_ASYNC_L2CACHE_TRIM) {
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
|
|
|
|
vdev_trim_l2arc(spa);
|
|
|
|
spa_config_exit(spa, SCL_CONFIG, FTAG);
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
}
|
|
|
|
|
2020-04-10 20:33:35 +03:00
|
|
|
/*
|
|
|
|
* Kick off L2 cache rebuilding.
|
|
|
|
*/
|
|
|
|
if (tasks & SPA_ASYNC_L2CACHE_REBUILD) {
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
spa_config_enter(spa, SCL_L2ARC, FTAG, RW_READER);
|
|
|
|
l2arc_spa_rebuild_start(spa);
|
|
|
|
spa_config_exit(spa, SCL_L2ARC, FTAG);
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Let the world know that we're done.
|
|
|
|
*/
|
|
|
|
mutex_enter(&spa->spa_async_lock);
|
|
|
|
spa->spa_async_thread = NULL;
|
|
|
|
cv_broadcast(&spa->spa_async_cv);
|
|
|
|
mutex_exit(&spa->spa_async_lock);
|
|
|
|
thread_exit();
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
spa_async_suspend(spa_t *spa)
|
|
|
|
{
|
|
|
|
mutex_enter(&spa->spa_async_lock);
|
|
|
|
spa->spa_async_suspended++;
|
OpenZFS 9079 - race condition in starting and ending condensing thread for indirect vdevs
The timeline of the race condition is the following:
[1] Thread A is about to finish condesing the first vdev in
spa_condense_indirect_thread(), so it calls the
spa_condense_indirect_complete_sync() sync task which sets
the spa_condensing_indirect field to NULL. Waiting for the
sync task to finish, thread A sleeps until the txg is done.
When this happens, thread A will acquire spa_async_lock and
set spa_condense_thread to NULL.
[2] While thread A waits for the txg to finish, thread B which is
running spa_sync() checks whether it should condense the
second vdev in vdev_indirect_should_condense() by checking the
spa_condensing_indirect field which was set to NULL by
spa_condense_indirect_thread() from thread A. So it goes on
and tries to spawn a new condensing thread in
spa_condense_indirect_start_sync() and the aforementioned
assertions fails because thread A has not set spa_condense_thread
to NULL (which is basically the last thing it does before returning).
The main issue here is that we rely on both spa_condensing_indirect
and spa_condense_thread to signify whether a condensing thread is
running. Ideally we would only use one throughout the codebase. In
addition, for managing spa_condense_thread we currently use
spa_async_lock which basically tights condensing to scrubing when
it comes to pausing and resuming those actions during spa export.
This commit introduces the ZTHR infrastructure, which is basically
threads created during spa_load()/spa_create() and exist until we
export or destroy the pool. ZTHRs sleep the majority of the time,
until they are notified to wake up and do some predefined type of work.
In the context of the current bug, a zthr to does the condensing of
indirect mappings replacing the older code that used bare kthreads.
When a pool is created, the condensing zthr is spawned but sleeps
right away, until it is awaken by a signal from spa_sync(). If an
existing pool is loaded, the condensing zthr looks if there is
anything to condense before going to sleep, in case we were condensing
mappings in the pool before it got exported.
The benefits of this solution are the following:
- The current bug is fixed
- spa_condensing_indirect is the sole indicator of whether we are
currently condensing or not
- condensing is more decoupled from the spa_async_thread related
functionality.
As a final note, this commit also sets up the path on upstreaming
other features that use the ZTHR code like zpool checkpoint and
fast clone deletion.
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9079
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3dc606ee
Closes #6900
2017-03-16 02:41:52 +03:00
|
|
|
while (spa->spa_async_thread != NULL)
|
2008-11-20 23:01:55 +03:00
|
|
|
cv_wait(&spa->spa_async_cv, &spa->spa_async_lock);
|
|
|
|
mutex_exit(&spa->spa_async_lock);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
|
|
|
spa_vdev_remove_suspend(spa);
|
OpenZFS 9079 - race condition in starting and ending condensing thread for indirect vdevs
The timeline of the race condition is the following:
[1] Thread A is about to finish condesing the first vdev in
spa_condense_indirect_thread(), so it calls the
spa_condense_indirect_complete_sync() sync task which sets
the spa_condensing_indirect field to NULL. Waiting for the
sync task to finish, thread A sleeps until the txg is done.
When this happens, thread A will acquire spa_async_lock and
set spa_condense_thread to NULL.
[2] While thread A waits for the txg to finish, thread B which is
running spa_sync() checks whether it should condense the
second vdev in vdev_indirect_should_condense() by checking the
spa_condensing_indirect field which was set to NULL by
spa_condense_indirect_thread() from thread A. So it goes on
and tries to spawn a new condensing thread in
spa_condense_indirect_start_sync() and the aforementioned
assertions fails because thread A has not set spa_condense_thread
to NULL (which is basically the last thing it does before returning).
The main issue here is that we rely on both spa_condensing_indirect
and spa_condense_thread to signify whether a condensing thread is
running. Ideally we would only use one throughout the codebase. In
addition, for managing spa_condense_thread we currently use
spa_async_lock which basically tights condensing to scrubing when
it comes to pausing and resuming those actions during spa export.
This commit introduces the ZTHR infrastructure, which is basically
threads created during spa_load()/spa_create() and exist until we
export or destroy the pool. ZTHRs sleep the majority of the time,
until they are notified to wake up and do some predefined type of work.
In the context of the current bug, a zthr to does the condensing of
indirect mappings replacing the older code that used bare kthreads.
When a pool is created, the condensing zthr is spawned but sleeps
right away, until it is awaken by a signal from spa_sync(). If an
existing pool is loaded, the condensing zthr looks if there is
anything to condense before going to sleep, in case we were condensing
mappings in the pool before it got exported.
The benefits of this solution are the following:
- The current bug is fixed
- spa_condensing_indirect is the sole indicator of whether we are
currently condensing or not
- condensing is more decoupled from the spa_async_thread related
functionality.
As a final note, this commit also sets up the path on upstreaming
other features that use the ZTHR code like zpool checkpoint and
fast clone deletion.
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9079
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3dc606ee
Closes #6900
2017-03-16 02:41:52 +03:00
|
|
|
|
|
|
|
zthr_t *condense_thread = spa->spa_condense_zthr;
|
2019-01-13 21:09:46 +03:00
|
|
|
if (condense_thread != NULL)
|
|
|
|
zthr_cancel(condense_thread);
|
2016-12-17 01:11:29 +03:00
|
|
|
|
|
|
|
zthr_t *discard_thread = spa->spa_checkpoint_discard_zthr;
|
2019-01-13 21:09:46 +03:00
|
|
|
if (discard_thread != NULL)
|
|
|
|
zthr_cancel(discard_thread);
|
2019-07-26 20:54:14 +03:00
|
|
|
|
|
|
|
zthr_t *ll_delete_thread = spa->spa_livelist_delete_zthr;
|
|
|
|
if (ll_delete_thread != NULL)
|
|
|
|
zthr_cancel(ll_delete_thread);
|
|
|
|
|
|
|
|
zthr_t *ll_condense_thread = spa->spa_livelist_condense_zthr;
|
|
|
|
if (ll_condense_thread != NULL)
|
|
|
|
zthr_cancel(ll_condense_thread);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
spa_async_resume(spa_t *spa)
|
|
|
|
{
|
|
|
|
mutex_enter(&spa->spa_async_lock);
|
|
|
|
ASSERT(spa->spa_async_suspended != 0);
|
|
|
|
spa->spa_async_suspended--;
|
|
|
|
mutex_exit(&spa->spa_async_lock);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
spa_restart_removal(spa);
|
OpenZFS 9079 - race condition in starting and ending condensing thread for indirect vdevs
The timeline of the race condition is the following:
[1] Thread A is about to finish condesing the first vdev in
spa_condense_indirect_thread(), so it calls the
spa_condense_indirect_complete_sync() sync task which sets
the spa_condensing_indirect field to NULL. Waiting for the
sync task to finish, thread A sleeps until the txg is done.
When this happens, thread A will acquire spa_async_lock and
set spa_condense_thread to NULL.
[2] While thread A waits for the txg to finish, thread B which is
running spa_sync() checks whether it should condense the
second vdev in vdev_indirect_should_condense() by checking the
spa_condensing_indirect field which was set to NULL by
spa_condense_indirect_thread() from thread A. So it goes on
and tries to spawn a new condensing thread in
spa_condense_indirect_start_sync() and the aforementioned
assertions fails because thread A has not set spa_condense_thread
to NULL (which is basically the last thing it does before returning).
The main issue here is that we rely on both spa_condensing_indirect
and spa_condense_thread to signify whether a condensing thread is
running. Ideally we would only use one throughout the codebase. In
addition, for managing spa_condense_thread we currently use
spa_async_lock which basically tights condensing to scrubing when
it comes to pausing and resuming those actions during spa export.
This commit introduces the ZTHR infrastructure, which is basically
threads created during spa_load()/spa_create() and exist until we
export or destroy the pool. ZTHRs sleep the majority of the time,
until they are notified to wake up and do some predefined type of work.
In the context of the current bug, a zthr to does the condensing of
indirect mappings replacing the older code that used bare kthreads.
When a pool is created, the condensing zthr is spawned but sleeps
right away, until it is awaken by a signal from spa_sync(). If an
existing pool is loaded, the condensing zthr looks if there is
anything to condense before going to sleep, in case we were condensing
mappings in the pool before it got exported.
The benefits of this solution are the following:
- The current bug is fixed
- spa_condensing_indirect is the sole indicator of whether we are
currently condensing or not
- condensing is more decoupled from the spa_async_thread related
functionality.
As a final note, this commit also sets up the path on upstreaming
other features that use the ZTHR code like zpool checkpoint and
fast clone deletion.
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9079
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3dc606ee
Closes #6900
2017-03-16 02:41:52 +03:00
|
|
|
|
|
|
|
zthr_t *condense_thread = spa->spa_condense_zthr;
|
2019-01-13 21:09:46 +03:00
|
|
|
if (condense_thread != NULL)
|
OpenZFS 9079 - race condition in starting and ending condensing thread for indirect vdevs
The timeline of the race condition is the following:
[1] Thread A is about to finish condesing the first vdev in
spa_condense_indirect_thread(), so it calls the
spa_condense_indirect_complete_sync() sync task which sets
the spa_condensing_indirect field to NULL. Waiting for the
sync task to finish, thread A sleeps until the txg is done.
When this happens, thread A will acquire spa_async_lock and
set spa_condense_thread to NULL.
[2] While thread A waits for the txg to finish, thread B which is
running spa_sync() checks whether it should condense the
second vdev in vdev_indirect_should_condense() by checking the
spa_condensing_indirect field which was set to NULL by
spa_condense_indirect_thread() from thread A. So it goes on
and tries to spawn a new condensing thread in
spa_condense_indirect_start_sync() and the aforementioned
assertions fails because thread A has not set spa_condense_thread
to NULL (which is basically the last thing it does before returning).
The main issue here is that we rely on both spa_condensing_indirect
and spa_condense_thread to signify whether a condensing thread is
running. Ideally we would only use one throughout the codebase. In
addition, for managing spa_condense_thread we currently use
spa_async_lock which basically tights condensing to scrubing when
it comes to pausing and resuming those actions during spa export.
This commit introduces the ZTHR infrastructure, which is basically
threads created during spa_load()/spa_create() and exist until we
export or destroy the pool. ZTHRs sleep the majority of the time,
until they are notified to wake up and do some predefined type of work.
In the context of the current bug, a zthr to does the condensing of
indirect mappings replacing the older code that used bare kthreads.
When a pool is created, the condensing zthr is spawned but sleeps
right away, until it is awaken by a signal from spa_sync(). If an
existing pool is loaded, the condensing zthr looks if there is
anything to condense before going to sleep, in case we were condensing
mappings in the pool before it got exported.
The benefits of this solution are the following:
- The current bug is fixed
- spa_condensing_indirect is the sole indicator of whether we are
currently condensing or not
- condensing is more decoupled from the spa_async_thread related
functionality.
As a final note, this commit also sets up the path on upstreaming
other features that use the ZTHR code like zpool checkpoint and
fast clone deletion.
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9079
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3dc606ee
Closes #6900
2017-03-16 02:41:52 +03:00
|
|
|
zthr_resume(condense_thread);
|
2016-12-17 01:11:29 +03:00
|
|
|
|
|
|
|
zthr_t *discard_thread = spa->spa_checkpoint_discard_zthr;
|
2019-01-13 21:09:46 +03:00
|
|
|
if (discard_thread != NULL)
|
2016-12-17 01:11:29 +03:00
|
|
|
zthr_resume(discard_thread);
|
2019-07-26 20:54:14 +03:00
|
|
|
|
|
|
|
zthr_t *ll_delete_thread = spa->spa_livelist_delete_zthr;
|
|
|
|
if (ll_delete_thread != NULL)
|
|
|
|
zthr_resume(ll_delete_thread);
|
|
|
|
|
|
|
|
zthr_t *ll_condense_thread = spa->spa_livelist_condense_zthr;
|
|
|
|
if (ll_condense_thread != NULL)
|
|
|
|
zthr_resume(ll_condense_thread);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2015-12-31 19:38:59 +03:00
|
|
|
static boolean_t
|
|
|
|
spa_async_tasks_pending(spa_t *spa)
|
|
|
|
{
|
|
|
|
uint_t non_config_tasks;
|
|
|
|
uint_t config_task;
|
|
|
|
boolean_t config_task_suspended;
|
|
|
|
|
|
|
|
non_config_tasks = spa->spa_async_tasks & ~SPA_ASYNC_CONFIG_UPDATE;
|
|
|
|
config_task = spa->spa_async_tasks & SPA_ASYNC_CONFIG_UPDATE;
|
|
|
|
if (spa->spa_ccw_fail_time == 0) {
|
|
|
|
config_task_suspended = B_FALSE;
|
|
|
|
} else {
|
|
|
|
config_task_suspended =
|
|
|
|
(gethrtime() - spa->spa_ccw_fail_time) <
|
2016-10-14 00:25:05 +03:00
|
|
|
((hrtime_t)zfs_ccw_retry_interval * NANOSEC);
|
2015-12-31 19:38:59 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
return (non_config_tasks || (config_task && !config_task_suspended));
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
spa_async_dispatch(spa_t *spa)
|
|
|
|
{
|
|
|
|
mutex_enter(&spa->spa_async_lock);
|
2015-12-31 19:38:59 +03:00
|
|
|
if (spa_async_tasks_pending(spa) &&
|
|
|
|
!spa->spa_async_suspended &&
|
2019-11-21 20:32:57 +03:00
|
|
|
spa->spa_async_thread == NULL)
|
2008-11-20 23:01:55 +03:00
|
|
|
spa->spa_async_thread = thread_create(NULL, 0,
|
|
|
|
spa_async_thread, spa, 0, &p0, TS_RUN, maxclsyspri);
|
|
|
|
mutex_exit(&spa->spa_async_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
spa_async_request(spa_t *spa, int task)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
zfs_dbgmsg("spa=%s async request task=%u", spa->spa_name, task);
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_enter(&spa->spa_async_lock);
|
|
|
|
spa->spa_async_tasks |= task;
|
|
|
|
mutex_exit(&spa->spa_async_lock);
|
|
|
|
}
|
|
|
|
|
2019-11-27 21:15:01 +03:00
|
|
|
int
|
|
|
|
spa_async_tasks(spa_t *spa)
|
|
|
|
{
|
|
|
|
return (spa->spa_async_tasks);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* ==========================================================================
|
|
|
|
* SPA syncing routines
|
|
|
|
* ==========================================================================
|
|
|
|
*/
|
|
|
|
|
2019-07-26 20:54:14 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static int
|
2019-07-26 20:54:14 +03:00
|
|
|
bpobj_enqueue_cb(void *arg, const blkptr_t *bp, boolean_t bp_freed,
|
|
|
|
dmu_tx_t *tx)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
bpobj_t *bpo = arg;
|
2019-07-26 20:54:14 +03:00
|
|
|
bpobj_enqueue(bpo, bp, bp_freed, tx);
|
2010-05-29 00:45:14 +04:00
|
|
|
return (0);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2019-07-26 20:54:14 +03:00
|
|
|
int
|
|
|
|
bpobj_enqueue_alloc_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
return (bpobj_enqueue_cb(arg, bp, B_FALSE, tx));
|
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
bpobj_enqueue_free_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
return (bpobj_enqueue_cb(arg, bp, B_TRUE, tx));
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static int
|
|
|
|
spa_free_sync_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
|
|
|
|
{
|
2020-02-29 01:49:44 +03:00
|
|
|
zio_t *pio = arg;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2020-02-29 01:49:44 +03:00
|
|
|
zio_nowait(zio_free_sync(pio, pio->io_spa, dmu_tx_get_txg(tx), bp,
|
|
|
|
pio->io_flags));
|
2010-05-29 00:45:14 +04:00
|
|
|
return (0);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2019-07-26 20:54:14 +03:00
|
|
|
static int
|
|
|
|
bpobj_spa_free_sync_cb(void *arg, const blkptr_t *bp, boolean_t bp_freed,
|
|
|
|
dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
ASSERT(!bp_freed);
|
|
|
|
return (spa_free_sync_cb(arg, bp, tx));
|
|
|
|
}
|
|
|
|
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
/*
|
|
|
|
* Note: this simple function is not inlined to make it easier to dtrace the
|
|
|
|
* amount of time spent syncing frees.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
spa_sync_frees(spa_t *spa, bplist_t *bpl, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
zio_t *zio = zio_root(spa, NULL, NULL, 0);
|
|
|
|
bplist_iterate(bpl, spa_free_sync_cb, zio, tx);
|
|
|
|
VERIFY(zio_wait(zio) == 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Note: this simple function is not inlined to make it easier to dtrace the
|
|
|
|
* amount of time spent syncing deferred frees.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
spa_sync_deferred_frees(spa_t *spa, dmu_tx_t *tx)
|
|
|
|
{
|
2019-01-18 20:50:16 +03:00
|
|
|
if (spa_sync_pass(spa) != 1)
|
|
|
|
return;
|
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
/*
|
|
|
|
* Note:
|
|
|
|
* If the log space map feature is active, we stop deferring
|
|
|
|
* frees to the next TXG and therefore running this function
|
|
|
|
* would be considered a no-op as spa_deferred_bpobj should
|
|
|
|
* not have any entries.
|
|
|
|
*
|
|
|
|
* That said we run this function anyway (instead of returning
|
|
|
|
* immediately) for the edge-case scenario where we just
|
|
|
|
* activated the log space map feature in this TXG but we have
|
|
|
|
* deferred frees from the previous TXG.
|
|
|
|
*/
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
zio_t *zio = zio_root(spa, NULL, NULL, 0);
|
|
|
|
VERIFY3U(bpobj_iterate(&spa->spa_deferred_bpobj,
|
2019-07-26 20:54:14 +03:00
|
|
|
bpobj_spa_free_sync_cb, zio, tx), ==, 0);
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
VERIFY0(zio_wait(zio));
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
spa_sync_nvlist(spa_t *spa, uint64_t obj, nvlist_t *nv, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
char *packed = NULL;
|
2008-12-03 23:09:06 +03:00
|
|
|
size_t bufsize;
|
2008-11-20 23:01:55 +03:00
|
|
|
size_t nvsize = 0;
|
|
|
|
dmu_buf_t *db;
|
|
|
|
|
|
|
|
VERIFY(nvlist_size(nv, &nvsize, NV_ENCODE_XDR) == 0);
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* Write full (SPA_CONFIG_BLOCKSIZE) blocks of configuration
|
2013-12-09 22:37:51 +04:00
|
|
|
* information. This avoids the dmu_buf_will_dirty() path and
|
2008-12-03 23:09:06 +03:00
|
|
|
* saves us a pre-read to get data we don't actually care about.
|
|
|
|
*/
|
2012-12-14 03:24:15 +04:00
|
|
|
bufsize = P2ROUNDUP((uint64_t)nvsize, SPA_CONFIG_BLOCKSIZE);
|
2014-11-21 03:09:39 +03:00
|
|
|
packed = vmem_alloc(bufsize, KM_SLEEP);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
VERIFY(nvlist_pack(nv, &packed, &nvsize, NV_ENCODE_XDR,
|
2014-11-21 03:09:39 +03:00
|
|
|
KM_SLEEP) == 0);
|
2022-02-25 16:26:54 +03:00
|
|
|
memset(packed + nvsize, 0, bufsize - nvsize);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
dmu_write(spa->spa_meta_objset, obj, 0, bufsize, packed, tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 22:46:09 +04:00
|
|
|
vmem_free(packed, bufsize);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
VERIFY(0 == dmu_bonus_hold(spa->spa_meta_objset, obj, FTAG, &db));
|
|
|
|
dmu_buf_will_dirty(db, tx);
|
|
|
|
*(uint64_t *)db->db_data = nvsize;
|
|
|
|
dmu_buf_rele(db, FTAG);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
spa_sync_aux_dev(spa_t *spa, spa_aux_vdev_t *sav, dmu_tx_t *tx,
|
|
|
|
const char *config, const char *entry)
|
|
|
|
{
|
|
|
|
nvlist_t *nvroot;
|
|
|
|
nvlist_t **list;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
if (!sav->sav_sync)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Update the MOS nvlist describing the list of available devices.
|
|
|
|
* spa_validate_aux() will have already made sure this nvlist is
|
|
|
|
* valid and the vdevs are labeled appropriately.
|
|
|
|
*/
|
|
|
|
if (sav->sav_object == 0) {
|
|
|
|
sav->sav_object = dmu_object_alloc(spa->spa_meta_objset,
|
|
|
|
DMU_OT_PACKED_NVLIST, 1 << 14, DMU_OT_PACKED_NVLIST_SIZE,
|
|
|
|
sizeof (uint64_t), tx);
|
|
|
|
VERIFY(zap_update(spa->spa_meta_objset,
|
|
|
|
DMU_POOL_DIRECTORY_OBJECT, entry, sizeof (uint64_t), 1,
|
|
|
|
&sav->sav_object, tx) == 0);
|
|
|
|
}
|
|
|
|
|
2021-10-27 02:15:38 +03:00
|
|
|
nvroot = fnvlist_alloc();
|
2008-11-20 23:01:55 +03:00
|
|
|
if (sav->sav_count == 0) {
|
2021-12-07 04:19:13 +03:00
|
|
|
fnvlist_add_nvlist_array(nvroot, config,
|
|
|
|
(const nvlist_t * const *)NULL, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
2014-11-21 03:09:39 +03:00
|
|
|
list = kmem_alloc(sav->sav_count*sizeof (void *), KM_SLEEP);
|
2008-11-20 23:01:55 +03:00
|
|
|
for (i = 0; i < sav->sav_count; i++)
|
|
|
|
list[i] = vdev_config_generate(spa, sav->sav_vdevs[i],
|
2010-05-29 00:45:14 +04:00
|
|
|
B_FALSE, VDEV_CONFIG_L2CACHE);
|
2021-12-07 04:19:13 +03:00
|
|
|
fnvlist_add_nvlist_array(nvroot, config,
|
|
|
|
(const nvlist_t * const *)list, sav->sav_count);
|
2008-11-20 23:01:55 +03:00
|
|
|
for (i = 0; i < sav->sav_count; i++)
|
|
|
|
nvlist_free(list[i]);
|
|
|
|
kmem_free(list, sav->sav_count * sizeof (void *));
|
|
|
|
}
|
|
|
|
|
|
|
|
spa_sync_nvlist(spa, sav->sav_object, nvroot, tx);
|
|
|
|
nvlist_free(nvroot);
|
|
|
|
|
|
|
|
sav->sav_sync = B_FALSE;
|
|
|
|
}
|
|
|
|
|
2016-04-11 23:16:57 +03:00
|
|
|
/*
|
|
|
|
* Rebuild spa's all-vdev ZAP from the vdev ZAPs indicated in each vdev_t.
|
|
|
|
* The all-vdev ZAP must be empty.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
spa_avz_build(vdev_t *vd, uint64_t avz, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
spa_t *spa = vd->vdev_spa;
|
|
|
|
|
|
|
|
if (vd->vdev_top_zap != 0) {
|
|
|
|
VERIFY0(zap_add_int(spa->spa_meta_objset, avz,
|
|
|
|
vd->vdev_top_zap, tx));
|
|
|
|
}
|
|
|
|
if (vd->vdev_leaf_zap != 0) {
|
|
|
|
VERIFY0(zap_add_int(spa->spa_meta_objset, avz,
|
|
|
|
vd->vdev_leaf_zap, tx));
|
|
|
|
}
|
2017-11-04 23:25:13 +03:00
|
|
|
for (uint64_t i = 0; i < vd->vdev_children; i++) {
|
2016-04-11 23:16:57 +03:00
|
|
|
spa_avz_build(vd->vdev_child[i], avz, tx);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
spa_sync_config_object(spa_t *spa, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
nvlist_t *config;
|
|
|
|
|
2016-04-11 23:16:57 +03:00
|
|
|
/*
|
|
|
|
* If the pool is being imported from a pre-per-vdev-ZAP version of ZFS,
|
|
|
|
* its config may not be dirty but we still need to build per-vdev ZAPs.
|
|
|
|
* Similarly, if the pool is being assembled (e.g. after a split), we
|
|
|
|
* need to rebuild the AVZ although the config may not be dirty.
|
|
|
|
*/
|
|
|
|
if (list_is_empty(&spa->spa_config_dirty_list) &&
|
|
|
|
spa->spa_avz_action == AVZ_ACTION_NONE)
|
2008-11-20 23:01:55 +03:00
|
|
|
return;
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_enter(spa, SCL_STATE, FTAG, RW_READER);
|
|
|
|
|
2016-04-11 23:16:57 +03:00
|
|
|
ASSERT(spa->spa_avz_action == AVZ_ACTION_NONE ||
|
2017-01-14 00:50:22 +03:00
|
|
|
spa->spa_avz_action == AVZ_ACTION_INITIALIZE ||
|
2016-04-11 23:16:57 +03:00
|
|
|
spa->spa_all_vdev_zaps != 0);
|
|
|
|
|
|
|
|
if (spa->spa_avz_action == AVZ_ACTION_REBUILD) {
|
|
|
|
/* Make and build the new AVZ */
|
|
|
|
uint64_t new_avz = zap_create(spa->spa_meta_objset,
|
|
|
|
DMU_OTN_ZAP_METADATA, DMU_OT_NONE, 0, tx);
|
|
|
|
spa_avz_build(spa->spa_root_vdev, new_avz, tx);
|
|
|
|
|
|
|
|
/* Diff old AVZ with new one */
|
2017-11-04 23:25:13 +03:00
|
|
|
zap_cursor_t zc;
|
|
|
|
zap_attribute_t za;
|
|
|
|
|
2016-04-11 23:16:57 +03:00
|
|
|
for (zap_cursor_init(&zc, spa->spa_meta_objset,
|
|
|
|
spa->spa_all_vdev_zaps);
|
|
|
|
zap_cursor_retrieve(&zc, &za) == 0;
|
|
|
|
zap_cursor_advance(&zc)) {
|
|
|
|
uint64_t vdzap = za.za_first_integer;
|
|
|
|
if (zap_lookup_int(spa->spa_meta_objset, new_avz,
|
|
|
|
vdzap) == ENOENT) {
|
|
|
|
/*
|
|
|
|
* ZAP is listed in old AVZ but not in new one;
|
|
|
|
* destroy it
|
|
|
|
*/
|
|
|
|
VERIFY0(zap_destroy(spa->spa_meta_objset, vdzap,
|
|
|
|
tx));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
zap_cursor_fini(&zc);
|
|
|
|
|
|
|
|
/* Destroy the old AVZ */
|
|
|
|
VERIFY0(zap_destroy(spa->spa_meta_objset,
|
|
|
|
spa->spa_all_vdev_zaps, tx));
|
|
|
|
|
|
|
|
/* Replace the old AVZ in the dir obj with the new one */
|
|
|
|
VERIFY0(zap_update(spa->spa_meta_objset,
|
|
|
|
DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_VDEV_ZAP_MAP,
|
|
|
|
sizeof (new_avz), 1, &new_avz, tx));
|
|
|
|
|
|
|
|
spa->spa_all_vdev_zaps = new_avz;
|
|
|
|
} else if (spa->spa_avz_action == AVZ_ACTION_DESTROY) {
|
|
|
|
zap_cursor_t zc;
|
|
|
|
zap_attribute_t za;
|
|
|
|
|
|
|
|
/* Walk through the AVZ and destroy all listed ZAPs */
|
|
|
|
for (zap_cursor_init(&zc, spa->spa_meta_objset,
|
|
|
|
spa->spa_all_vdev_zaps);
|
|
|
|
zap_cursor_retrieve(&zc, &za) == 0;
|
|
|
|
zap_cursor_advance(&zc)) {
|
|
|
|
uint64_t zap = za.za_first_integer;
|
|
|
|
VERIFY0(zap_destroy(spa->spa_meta_objset, zap, tx));
|
|
|
|
}
|
|
|
|
|
|
|
|
zap_cursor_fini(&zc);
|
|
|
|
|
|
|
|
/* Destroy and unlink the AVZ itself */
|
|
|
|
VERIFY0(zap_destroy(spa->spa_meta_objset,
|
|
|
|
spa->spa_all_vdev_zaps, tx));
|
|
|
|
VERIFY0(zap_remove(spa->spa_meta_objset,
|
|
|
|
DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_VDEV_ZAP_MAP, tx));
|
|
|
|
spa->spa_all_vdev_zaps = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (spa->spa_all_vdev_zaps == 0) {
|
|
|
|
spa->spa_all_vdev_zaps = zap_create_link(spa->spa_meta_objset,
|
|
|
|
DMU_OTN_ZAP_METADATA, DMU_POOL_DIRECTORY_OBJECT,
|
|
|
|
DMU_POOL_VDEV_ZAP_MAP, tx);
|
|
|
|
}
|
|
|
|
spa->spa_avz_action = AVZ_ACTION_NONE;
|
|
|
|
|
|
|
|
/* Create ZAPs for vdevs that don't have them. */
|
|
|
|
vdev_construct_zaps(spa->spa_root_vdev, tx);
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
config = spa_config_generate(spa, spa->spa_root_vdev,
|
|
|
|
dmu_tx_get_txg(tx), B_FALSE);
|
|
|
|
|
2012-12-15 04:28:49 +04:00
|
|
|
/*
|
|
|
|
* If we're upgrading the spa version then make sure that
|
|
|
|
* the config object gets updated with the correct version.
|
|
|
|
*/
|
|
|
|
if (spa->spa_ubsync.ub_version < spa->spa_uberblock.ub_version)
|
|
|
|
fnvlist_add_uint64(config, ZPOOL_CONFIG_VERSION,
|
|
|
|
spa->spa_uberblock.ub_version);
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_exit(spa, SCL_STATE, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-04-01 06:54:07 +03:00
|
|
|
nvlist_free(spa->spa_config_syncing);
|
2008-11-20 23:01:55 +03:00
|
|
|
spa->spa_config_syncing = config;
|
|
|
|
|
|
|
|
spa_sync_nvlist(spa, spa->spa_config_object, config, tx);
|
|
|
|
}
|
|
|
|
|
2012-12-14 03:24:15 +04:00
|
|
|
static void
|
2013-09-04 16:00:57 +04:00
|
|
|
spa_sync_version(void *arg, dmu_tx_t *tx)
|
2012-12-14 03:24:15 +04:00
|
|
|
{
|
2013-09-04 16:00:57 +04:00
|
|
|
uint64_t *versionp = arg;
|
|
|
|
uint64_t version = *versionp;
|
|
|
|
spa_t *spa = dmu_tx_pool(tx)->dp_spa;
|
2012-12-14 03:24:15 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Setting the version is special cased when first creating the pool.
|
|
|
|
*/
|
|
|
|
ASSERT(tx->tx_txg != TXG_INITIAL);
|
|
|
|
|
2013-02-10 04:25:55 +04:00
|
|
|
ASSERT(SPA_VERSION_IS_SUPPORTED(version));
|
2012-12-14 03:24:15 +04:00
|
|
|
ASSERT(version >= spa_version(spa));
|
|
|
|
|
|
|
|
spa->spa_uberblock.ub_version = version;
|
|
|
|
vdev_config_dirty(spa->spa_root_vdev);
|
2019-09-12 23:28:26 +03:00
|
|
|
spa_history_log_internal(spa, "set", tx, "version=%lld",
|
|
|
|
(longlong_t)version);
|
2012-12-14 03:24:15 +04:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Set zpool properties.
|
|
|
|
*/
|
|
|
|
static void
|
2013-09-04 16:00:57 +04:00
|
|
|
spa_sync_props(void *arg, dmu_tx_t *tx)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2013-09-04 16:00:57 +04:00
|
|
|
nvlist_t *nvp = arg;
|
|
|
|
spa_t *spa = dmu_tx_pool(tx)->dp_spa;
|
2008-11-20 23:01:55 +03:00
|
|
|
objset_t *mos = spa->spa_meta_objset;
|
2012-12-14 03:24:15 +04:00
|
|
|
nvpair_t *elem = NULL;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
|
|
|
mutex_enter(&spa->spa_props_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
while ((elem = nvlist_next_nvpair(nvp, elem))) {
|
2012-12-14 03:24:15 +04:00
|
|
|
uint64_t intval;
|
|
|
|
char *strval, *fname;
|
|
|
|
zpool_prop_t prop;
|
|
|
|
const char *propname;
|
|
|
|
zprop_type_t proptype;
|
2013-10-08 21:13:05 +04:00
|
|
|
spa_feature_t fid;
|
2012-12-14 03:24:15 +04:00
|
|
|
|
2018-01-19 20:22:37 +03:00
|
|
|
switch (prop = zpool_name_to_prop(nvpair_name(elem))) {
|
|
|
|
case ZPOOL_PROP_INVAL:
|
2012-12-14 03:24:15 +04:00
|
|
|
/*
|
|
|
|
* We checked this earlier in spa_prop_validate().
|
|
|
|
*/
|
|
|
|
ASSERT(zpool_prop_feature(nvpair_name(elem)));
|
|
|
|
|
|
|
|
fname = strchr(nvpair_name(elem), '@') + 1;
|
2013-10-08 21:13:05 +04:00
|
|
|
VERIFY0(zfeature_lookup_name(fname, &fid));
|
2012-12-14 03:24:15 +04:00
|
|
|
|
2013-10-08 21:13:05 +04:00
|
|
|
spa_feature_enable(spa, fid, tx);
|
2013-08-28 15:45:09 +04:00
|
|
|
spa_history_log_internal(spa, "set", tx,
|
|
|
|
"%s=enabled", nvpair_name(elem));
|
2012-12-14 03:24:15 +04:00
|
|
|
break;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
case ZPOOL_PROP_VERSION:
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
intval = fnvpair_value_uint64(elem);
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2017-01-03 20:31:18 +03:00
|
|
|
* The version is synced separately before other
|
2012-12-14 03:24:15 +04:00
|
|
|
* properties and should be correct by now.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2012-12-14 03:24:15 +04:00
|
|
|
ASSERT3U(spa_version(spa), >=, intval);
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
|
|
|
|
case ZPOOL_PROP_ALTROOT:
|
|
|
|
/*
|
|
|
|
* 'altroot' is a non-persistent property. It should
|
|
|
|
* have been set temporarily at creation or import time.
|
|
|
|
*/
|
|
|
|
ASSERT(spa->spa_root != NULL);
|
|
|
|
break;
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
case ZPOOL_PROP_READONLY:
|
2008-11-20 23:01:55 +03:00
|
|
|
case ZPOOL_PROP_CACHEFILE:
|
|
|
|
/*
|
2019-09-03 03:56:41 +03:00
|
|
|
* 'readonly' and 'cachefile' are also non-persistent
|
2010-08-27 01:24:34 +04:00
|
|
|
* properties.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
break;
|
2011-11-15 23:01:27 +04:00
|
|
|
case ZPOOL_PROP_COMMENT:
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
strval = fnvpair_value_string(elem);
|
2011-11-15 23:01:27 +04:00
|
|
|
if (spa->spa_comment != NULL)
|
|
|
|
spa_strfree(spa->spa_comment);
|
|
|
|
spa->spa_comment = spa_strdup(strval);
|
|
|
|
/*
|
|
|
|
* We need to dirty the configuration on all the vdevs
|
2021-06-25 00:30:02 +03:00
|
|
|
* so that their labels get updated. We also need to
|
|
|
|
* update the cache file to keep it in sync with the
|
|
|
|
* MOS version. It's unnecessary to do this for pool
|
|
|
|
* creation since the vdev's configuration has already
|
|
|
|
* been dirtied.
|
2011-11-15 23:01:27 +04:00
|
|
|
*/
|
2021-06-25 00:30:02 +03:00
|
|
|
if (tx->tx_txg != TXG_INITIAL) {
|
2011-11-15 23:01:27 +04:00
|
|
|
vdev_config_dirty(spa->spa_root_vdev);
|
2021-06-25 00:30:02 +03:00
|
|
|
spa_async_request(spa, SPA_ASYNC_CONFIG_UPDATE);
|
|
|
|
}
|
2013-08-28 15:45:09 +04:00
|
|
|
spa_history_log_internal(spa, "set", tx,
|
|
|
|
"%s=%s", nvpair_name(elem), strval);
|
2011-11-15 23:01:27 +04:00
|
|
|
break;
|
2021-02-18 08:30:45 +03:00
|
|
|
case ZPOOL_PROP_COMPATIBILITY:
|
|
|
|
strval = fnvpair_value_string(elem);
|
|
|
|
if (spa->spa_compatibility != NULL)
|
|
|
|
spa_strfree(spa->spa_compatibility);
|
|
|
|
spa->spa_compatibility = spa_strdup(strval);
|
|
|
|
/*
|
|
|
|
* Dirty the configuration on vdevs as above.
|
|
|
|
*/
|
2021-06-25 00:30:02 +03:00
|
|
|
if (tx->tx_txg != TXG_INITIAL) {
|
2021-02-18 08:30:45 +03:00
|
|
|
vdev_config_dirty(spa->spa_root_vdev);
|
2021-06-25 00:30:02 +03:00
|
|
|
spa_async_request(spa, SPA_ASYNC_CONFIG_UPDATE);
|
|
|
|
}
|
|
|
|
|
2021-02-18 08:30:45 +03:00
|
|
|
spa_history_log_internal(spa, "set", tx,
|
|
|
|
"%s=%s", nvpair_name(elem), strval);
|
|
|
|
break;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
default:
|
|
|
|
/*
|
|
|
|
* Set pool property values in the poolprops mos object.
|
|
|
|
*/
|
|
|
|
if (spa->spa_pool_props_object == 0) {
|
2012-12-14 03:24:15 +04:00
|
|
|
spa->spa_pool_props_object =
|
|
|
|
zap_create_link(mos, DMU_OT_POOL_PROPS,
|
2008-11-20 23:01:55 +03:00
|
|
|
DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_PROPS,
|
2012-12-14 03:24:15 +04:00
|
|
|
tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/* normalize the property name */
|
|
|
|
propname = zpool_prop_to_name(prop);
|
|
|
|
proptype = zpool_prop_get_type(prop);
|
|
|
|
|
|
|
|
if (nvpair_type(elem) == DATA_TYPE_STRING) {
|
|
|
|
ASSERT(proptype == PROP_TYPE_STRING);
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
strval = fnvpair_value_string(elem);
|
|
|
|
VERIFY0(zap_update(mos,
|
2008-11-20 23:01:55 +03:00
|
|
|
spa->spa_pool_props_object, propname,
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
1, strlen(strval) + 1, strval, tx));
|
2013-08-28 15:45:09 +04:00
|
|
|
spa_history_log_internal(spa, "set", tx,
|
|
|
|
"%s=%s", nvpair_name(elem), strval);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else if (nvpair_type(elem) == DATA_TYPE_UINT64) {
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
intval = fnvpair_value_uint64(elem);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (proptype == PROP_TYPE_INDEX) {
|
|
|
|
const char *unused;
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
VERIFY0(zpool_prop_index_to_string(
|
|
|
|
prop, intval, &unused));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
VERIFY0(zap_update(mos,
|
2008-11-20 23:01:55 +03:00
|
|
|
spa->spa_pool_props_object, propname,
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
8, 1, &intval, tx));
|
2013-08-28 15:45:09 +04:00
|
|
|
spa_history_log_internal(spa, "set", tx,
|
2019-09-12 23:28:26 +03:00
|
|
|
"%s=%lld", nvpair_name(elem),
|
|
|
|
(longlong_t)intval);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
|
|
|
ASSERT(0); /* not allowed */
|
|
|
|
}
|
|
|
|
|
|
|
|
switch (prop) {
|
|
|
|
case ZPOOL_PROP_DELEGATION:
|
|
|
|
spa->spa_delegation = intval;
|
|
|
|
break;
|
|
|
|
case ZPOOL_PROP_BOOTFS:
|
|
|
|
spa->spa_bootfs = intval;
|
|
|
|
break;
|
|
|
|
case ZPOOL_PROP_FAILUREMODE:
|
|
|
|
spa->spa_failmode = intval;
|
|
|
|
break;
|
2019-03-29 19:13:20 +03:00
|
|
|
case ZPOOL_PROP_AUTOTRIM:
|
|
|
|
spa->spa_autotrim = intval;
|
|
|
|
spa_async_request(spa,
|
|
|
|
SPA_ASYNC_AUTOTRIM_RESTART);
|
|
|
|
break;
|
2009-07-03 02:44:48 +04:00
|
|
|
case ZPOOL_PROP_AUTOEXPAND:
|
|
|
|
spa->spa_autoexpand = intval;
|
2010-05-29 00:45:14 +04:00
|
|
|
if (tx->tx_txg != TXG_INITIAL)
|
|
|
|
spa_async_request(spa,
|
|
|
|
SPA_ASYNC_AUTOEXPAND);
|
|
|
|
break;
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
case ZPOOL_PROP_MULTIHOST:
|
|
|
|
spa->spa_multihost = intval;
|
|
|
|
break;
|
2008-11-20 23:01:55 +03:00
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
|
|
|
|
mutex_exit(&spa->spa_props_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Perform one-time upgrade on-disk changes. spa_version() does not
|
|
|
|
* reflect the new version this txg, so there must be no changes this
|
|
|
|
* txg to anything that the upgrade code depends on after it executes.
|
|
|
|
* Therefore this must be called after dsl_pool_sync() does the sync
|
|
|
|
* tasks.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
spa_sync_upgrades(spa_t *spa, dmu_tx_t *tx)
|
|
|
|
{
|
2019-01-18 20:50:16 +03:00
|
|
|
if (spa_sync_pass(spa) != 1)
|
|
|
|
return;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2019-01-18 20:50:16 +03:00
|
|
|
dsl_pool_t *dp = spa->spa_dsl_pool;
|
2013-09-04 16:00:57 +04:00
|
|
|
rrw_enter(&dp->dp_config_rwlock, RW_WRITER, FTAG);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (spa->spa_ubsync.ub_version < SPA_VERSION_ORIGIN &&
|
|
|
|
spa->spa_uberblock.ub_version >= SPA_VERSION_ORIGIN) {
|
|
|
|
dsl_pool_create_origin(dp, tx);
|
|
|
|
|
|
|
|
/* Keeping the origin open increases spa_minref */
|
|
|
|
spa->spa_minref += 3;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (spa->spa_ubsync.ub_version < SPA_VERSION_NEXT_CLONES &&
|
|
|
|
spa->spa_uberblock.ub_version >= SPA_VERSION_NEXT_CLONES) {
|
|
|
|
dsl_pool_upgrade_clones(dp, tx);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (spa->spa_ubsync.ub_version < SPA_VERSION_DIR_CLONES &&
|
|
|
|
spa->spa_uberblock.ub_version >= SPA_VERSION_DIR_CLONES) {
|
|
|
|
dsl_pool_upgrade_dir_clones(dp, tx);
|
|
|
|
|
|
|
|
/* Keeping the freedir open increases spa_minref */
|
|
|
|
spa->spa_minref += 3;
|
|
|
|
}
|
2012-12-14 03:24:15 +04:00
|
|
|
|
|
|
|
if (spa->spa_ubsync.ub_version < SPA_VERSION_FEATURES &&
|
|
|
|
spa->spa_uberblock.ub_version >= SPA_VERSION_FEATURES) {
|
|
|
|
spa_feature_create_zap_objects(spa, tx);
|
|
|
|
}
|
2014-10-18 19:58:11 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* LZ4_COMPRESS feature's behaviour was changed to activate_on_enable
|
|
|
|
* when possibility to use lz4 compression for metadata was added
|
|
|
|
* Old pools that have this feature enabled must be upgraded to have
|
|
|
|
* this feature active
|
|
|
|
*/
|
|
|
|
if (spa->spa_uberblock.ub_version >= SPA_VERSION_FEATURES) {
|
|
|
|
boolean_t lz4_en = spa_feature_is_enabled(spa,
|
|
|
|
SPA_FEATURE_LZ4_COMPRESS);
|
|
|
|
boolean_t lz4_ac = spa_feature_is_active(spa,
|
|
|
|
SPA_FEATURE_LZ4_COMPRESS);
|
|
|
|
|
|
|
|
if (lz4_en && !lz4_ac)
|
|
|
|
spa_feature_incr(spa, SPA_FEATURE_LZ4_COMPRESS, tx);
|
|
|
|
}
|
2016-06-16 01:47:05 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we haven't written the salt, do so now. Note that the
|
|
|
|
* feature may not be activated yet, but that's fine since
|
|
|
|
* the presence of this ZAP entry is backwards compatible.
|
|
|
|
*/
|
|
|
|
if (zap_contains(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
|
|
|
|
DMU_POOL_CHECKSUM_SALT) == ENOENT) {
|
|
|
|
VERIFY0(zap_add(spa->spa_meta_objset,
|
|
|
|
DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_CHECKSUM_SALT, 1,
|
|
|
|
sizeof (spa->spa_cksum_salt.zcs_bytes),
|
|
|
|
spa->spa_cksum_salt.zcs_bytes, tx));
|
|
|
|
}
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
rrw_exit(&dp->dp_config_rwlock, FTAG);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
static void
|
|
|
|
vdev_indirect_state_sync_verify(vdev_t *vd)
|
|
|
|
{
|
2019-12-05 23:37:00 +03:00
|
|
|
vdev_indirect_mapping_t *vim __maybe_unused = vd->vdev_indirect_mapping;
|
|
|
|
vdev_indirect_births_t *vib __maybe_unused = vd->vdev_indirect_births;
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
|
|
|
if (vd->vdev_ops == &vdev_indirect_ops) {
|
|
|
|
ASSERT(vim != NULL);
|
|
|
|
ASSERT(vib != NULL);
|
|
|
|
}
|
|
|
|
|
2018-10-10 01:42:42 +03:00
|
|
|
uint64_t obsolete_sm_object = 0;
|
|
|
|
ASSERT0(vdev_obsolete_sm_object(vd, &obsolete_sm_object));
|
|
|
|
if (obsolete_sm_object != 0) {
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
ASSERT(vd->vdev_obsolete_sm != NULL);
|
|
|
|
ASSERT(vd->vdev_removing ||
|
|
|
|
vd->vdev_ops == &vdev_indirect_ops);
|
|
|
|
ASSERT(vdev_indirect_mapping_num_entries(vim) > 0);
|
|
|
|
ASSERT(vdev_indirect_mapping_bytes_mapped(vim) > 0);
|
2018-10-10 01:42:42 +03:00
|
|
|
ASSERT3U(obsolete_sm_object, ==,
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
space_map_object(vd->vdev_obsolete_sm));
|
|
|
|
ASSERT3U(vdev_indirect_mapping_bytes_mapped(vim), >=,
|
|
|
|
space_map_allocated(vd->vdev_obsolete_sm));
|
|
|
|
}
|
|
|
|
ASSERT(vd->vdev_obsolete_segments != NULL);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Since frees / remaps to an indirect vdev can only
|
|
|
|
* happen in syncing context, the obsolete segments
|
|
|
|
* tree must be empty when we start syncing.
|
|
|
|
*/
|
|
|
|
ASSERT0(range_tree_space(vd->vdev_obsolete_segments));
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2019-01-18 20:50:16 +03:00
|
|
|
* Set the top-level vdev's max queue depth. Evaluate each top-level's
|
|
|
|
* async write queue depth in case it changed. The max queue depth will
|
|
|
|
* not change in the middle of syncing out this txg.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2019-01-18 20:50:16 +03:00
|
|
|
static void
|
|
|
|
spa_sync_adjust_vdev_max_queue_depth(spa_t *spa)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2019-01-18 20:50:16 +03:00
|
|
|
ASSERT(spa_writeable(spa));
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
2016-10-14 03:59:18 +03:00
|
|
|
uint32_t max_queue_depth = zfs_vdev_async_write_max_active *
|
|
|
|
zfs_vdev_queue_depth_pct / 100;
|
2019-01-18 20:50:16 +03:00
|
|
|
metaslab_class_t *normal = spa_normal_class(spa);
|
|
|
|
metaslab_class_t *special = spa_special_class(spa);
|
|
|
|
metaslab_class_t *dedup = spa_dedup_class(spa);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
OpenZFS 9112 - Improve allocation performance on high-end systems
Overview
========
We parallelize the allocation process by creating the concept of
"allocators". There are a certain number of allocators per metaslab
group, defined by the value of a tunable at pool open time. Each
allocator for a given metaslab group has up to 2 active metaslabs; one
"primary", and one "secondary". The primary and secondary weight mean
the same thing they did in in the pre-allocator world; primary metaslabs
are used for most allocations, secondary metaslabs are used for ditto
blocks being allocated in the same metaslab group. There is also the
CLAIM weight, which has been separated out from the other weights, but
that is less important to understanding the patch. The active metaslabs
for each allocator are moved from their normal place in the metaslab
tree for the group to the back of the tree. This way, they will not be
selected for use by other allocators searching for new metaslabs unless
all the passive metaslabs are unsuitable for allocations. If that does
happen, the allocators will "steal" from each other to ensure that IOs
don't fail until there is truly no space left to perform allocations.
In addition, the alloc queue for each metaslab group has been broken
into a separate queue for each allocator. We don't want to dramatically
increase the number of inflight IOs on low-end systems, because it can
significantly increase txg times. On the other hand, we want to ensure
that there are enough IOs for each allocator to allow for good
coalescing before sending the IOs to the disk. As a result, we take a
compromise path; each allocator's alloc queue max depth starts at a
certain value for every txg. Every time an IO completes, we increase the
max depth. This should hopefully provide a good balance between the two
failure modes, while not dramatically increasing complexity.
We also parallelize the spa_alloc_tree and spa_alloc_lock, which cause
very similar contention when selecting IOs to allocate. This
parallelization uses the same allocator scheme as metaslab selection.
Performance Results
===================
Performance improvements from this change can vary significantly based
on the number of CPUs in the system, whether or not the system has a
NUMA architecture, the speed of the drives, the values for the various
tunables, and the workload being performed. For an fio async sequential
write workload on a 24 core NUMA system with 256 GB of RAM and 8 128 GB
SSDs, there is a roughly 25% performance improvement.
Future Work
===========
Analysis of the performance of the system with this patch applied shows
that a significant new bottleneck is the vdev disk queues, which also
need to be parallelized. Prototyping of this change has occurred, and
there was a performance improvement, but more work needs to be done
before its stability has been verified and it is ready to be upstreamed.
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Porting Notes:
* Fix reservation test failures by increasing tolerance.
OpenZFS-issue: https://illumos.org/issues/9112
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3f3cc3c3
Closes #7682
2018-02-12 23:56:06 +03:00
|
|
|
uint64_t slots_per_allocator = 0;
|
2017-11-04 23:25:13 +03:00
|
|
|
for (int c = 0; c < rvd->vdev_children; c++) {
|
2016-10-14 03:59:18 +03:00
|
|
|
vdev_t *tvd = rvd->vdev_child[c];
|
2018-09-06 04:33:36 +03:00
|
|
|
|
2019-01-18 20:50:16 +03:00
|
|
|
metaslab_group_t *mg = tvd->vdev_mg;
|
2018-09-06 04:33:36 +03:00
|
|
|
if (mg == NULL || !metaslab_group_initialized(mg))
|
|
|
|
continue;
|
2016-10-14 03:59:18 +03:00
|
|
|
|
2019-01-18 20:50:16 +03:00
|
|
|
metaslab_class_t *mc = mg->mg_class;
|
2018-09-06 04:33:36 +03:00
|
|
|
if (mc != normal && mc != special && mc != dedup)
|
2016-10-14 03:59:18 +03:00
|
|
|
continue;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* It is safe to do a lock-free check here because only async
|
|
|
|
* allocations look at mg_max_alloc_queue_depth, and async
|
|
|
|
* allocations all happen from spa_sync().
|
|
|
|
*/
|
2020-04-22 20:26:56 +03:00
|
|
|
for (int i = 0; i < mg->mg_allocators; i++) {
|
2018-10-01 20:42:05 +03:00
|
|
|
ASSERT0(zfs_refcount_count(
|
2020-04-22 20:26:56 +03:00
|
|
|
&(mg->mg_allocator[i].mga_alloc_queue_depth)));
|
|
|
|
}
|
2016-10-14 03:59:18 +03:00
|
|
|
mg->mg_max_alloc_queue_depth = max_queue_depth;
|
OpenZFS 9112 - Improve allocation performance on high-end systems
Overview
========
We parallelize the allocation process by creating the concept of
"allocators". There are a certain number of allocators per metaslab
group, defined by the value of a tunable at pool open time. Each
allocator for a given metaslab group has up to 2 active metaslabs; one
"primary", and one "secondary". The primary and secondary weight mean
the same thing they did in in the pre-allocator world; primary metaslabs
are used for most allocations, secondary metaslabs are used for ditto
blocks being allocated in the same metaslab group. There is also the
CLAIM weight, which has been separated out from the other weights, but
that is less important to understanding the patch. The active metaslabs
for each allocator are moved from their normal place in the metaslab
tree for the group to the back of the tree. This way, they will not be
selected for use by other allocators searching for new metaslabs unless
all the passive metaslabs are unsuitable for allocations. If that does
happen, the allocators will "steal" from each other to ensure that IOs
don't fail until there is truly no space left to perform allocations.
In addition, the alloc queue for each metaslab group has been broken
into a separate queue for each allocator. We don't want to dramatically
increase the number of inflight IOs on low-end systems, because it can
significantly increase txg times. On the other hand, we want to ensure
that there are enough IOs for each allocator to allow for good
coalescing before sending the IOs to the disk. As a result, we take a
compromise path; each allocator's alloc queue max depth starts at a
certain value for every txg. Every time an IO completes, we increase the
max depth. This should hopefully provide a good balance between the two
failure modes, while not dramatically increasing complexity.
We also parallelize the spa_alloc_tree and spa_alloc_lock, which cause
very similar contention when selecting IOs to allocate. This
parallelization uses the same allocator scheme as metaslab selection.
Performance Results
===================
Performance improvements from this change can vary significantly based
on the number of CPUs in the system, whether or not the system has a
NUMA architecture, the speed of the drives, the values for the various
tunables, and the workload being performed. For an fio async sequential
write workload on a 24 core NUMA system with 256 GB of RAM and 8 128 GB
SSDs, there is a roughly 25% performance improvement.
Future Work
===========
Analysis of the performance of the system with this patch applied shows
that a significant new bottleneck is the vdev disk queues, which also
need to be parallelized. Prototyping of this change has occurred, and
there was a performance improvement, but more work needs to be done
before its stability has been verified and it is ready to be upstreamed.
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Porting Notes:
* Fix reservation test failures by increasing tolerance.
OpenZFS-issue: https://illumos.org/issues/9112
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3f3cc3c3
Closes #7682
2018-02-12 23:56:06 +03:00
|
|
|
|
2020-04-22 20:26:56 +03:00
|
|
|
for (int i = 0; i < mg->mg_allocators; i++) {
|
|
|
|
mg->mg_allocator[i].mga_cur_max_alloc_queue_depth =
|
OpenZFS 9112 - Improve allocation performance on high-end systems
Overview
========
We parallelize the allocation process by creating the concept of
"allocators". There are a certain number of allocators per metaslab
group, defined by the value of a tunable at pool open time. Each
allocator for a given metaslab group has up to 2 active metaslabs; one
"primary", and one "secondary". The primary and secondary weight mean
the same thing they did in in the pre-allocator world; primary metaslabs
are used for most allocations, secondary metaslabs are used for ditto
blocks being allocated in the same metaslab group. There is also the
CLAIM weight, which has been separated out from the other weights, but
that is less important to understanding the patch. The active metaslabs
for each allocator are moved from their normal place in the metaslab
tree for the group to the back of the tree. This way, they will not be
selected for use by other allocators searching for new metaslabs unless
all the passive metaslabs are unsuitable for allocations. If that does
happen, the allocators will "steal" from each other to ensure that IOs
don't fail until there is truly no space left to perform allocations.
In addition, the alloc queue for each metaslab group has been broken
into a separate queue for each allocator. We don't want to dramatically
increase the number of inflight IOs on low-end systems, because it can
significantly increase txg times. On the other hand, we want to ensure
that there are enough IOs for each allocator to allow for good
coalescing before sending the IOs to the disk. As a result, we take a
compromise path; each allocator's alloc queue max depth starts at a
certain value for every txg. Every time an IO completes, we increase the
max depth. This should hopefully provide a good balance between the two
failure modes, while not dramatically increasing complexity.
We also parallelize the spa_alloc_tree and spa_alloc_lock, which cause
very similar contention when selecting IOs to allocate. This
parallelization uses the same allocator scheme as metaslab selection.
Performance Results
===================
Performance improvements from this change can vary significantly based
on the number of CPUs in the system, whether or not the system has a
NUMA architecture, the speed of the drives, the values for the various
tunables, and the workload being performed. For an fio async sequential
write workload on a 24 core NUMA system with 256 GB of RAM and 8 128 GB
SSDs, there is a roughly 25% performance improvement.
Future Work
===========
Analysis of the performance of the system with this patch applied shows
that a significant new bottleneck is the vdev disk queues, which also
need to be parallelized. Prototyping of this change has occurred, and
there was a performance improvement, but more work needs to be done
before its stability has been verified and it is ready to be upstreamed.
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Porting Notes:
* Fix reservation test failures by increasing tolerance.
OpenZFS-issue: https://illumos.org/issues/9112
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3f3cc3c3
Closes #7682
2018-02-12 23:56:06 +03:00
|
|
|
zfs_vdev_def_queue_depth;
|
|
|
|
}
|
|
|
|
slots_per_allocator += zfs_vdev_def_queue_depth;
|
2016-10-14 03:59:18 +03:00
|
|
|
}
|
2018-09-06 04:33:36 +03:00
|
|
|
|
OpenZFS 9112 - Improve allocation performance on high-end systems
Overview
========
We parallelize the allocation process by creating the concept of
"allocators". There are a certain number of allocators per metaslab
group, defined by the value of a tunable at pool open time. Each
allocator for a given metaslab group has up to 2 active metaslabs; one
"primary", and one "secondary". The primary and secondary weight mean
the same thing they did in in the pre-allocator world; primary metaslabs
are used for most allocations, secondary metaslabs are used for ditto
blocks being allocated in the same metaslab group. There is also the
CLAIM weight, which has been separated out from the other weights, but
that is less important to understanding the patch. The active metaslabs
for each allocator are moved from their normal place in the metaslab
tree for the group to the back of the tree. This way, they will not be
selected for use by other allocators searching for new metaslabs unless
all the passive metaslabs are unsuitable for allocations. If that does
happen, the allocators will "steal" from each other to ensure that IOs
don't fail until there is truly no space left to perform allocations.
In addition, the alloc queue for each metaslab group has been broken
into a separate queue for each allocator. We don't want to dramatically
increase the number of inflight IOs on low-end systems, because it can
significantly increase txg times. On the other hand, we want to ensure
that there are enough IOs for each allocator to allow for good
coalescing before sending the IOs to the disk. As a result, we take a
compromise path; each allocator's alloc queue max depth starts at a
certain value for every txg. Every time an IO completes, we increase the
max depth. This should hopefully provide a good balance between the two
failure modes, while not dramatically increasing complexity.
We also parallelize the spa_alloc_tree and spa_alloc_lock, which cause
very similar contention when selecting IOs to allocate. This
parallelization uses the same allocator scheme as metaslab selection.
Performance Results
===================
Performance improvements from this change can vary significantly based
on the number of CPUs in the system, whether or not the system has a
NUMA architecture, the speed of the drives, the values for the various
tunables, and the workload being performed. For an fio async sequential
write workload on a 24 core NUMA system with 256 GB of RAM and 8 128 GB
SSDs, there is a roughly 25% performance improvement.
Future Work
===========
Analysis of the performance of the system with this patch applied shows
that a significant new bottleneck is the vdev disk queues, which also
need to be parallelized. Prototyping of this change has occurred, and
there was a performance improvement, but more work needs to be done
before its stability has been verified and it is ready to be upstreamed.
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Porting Notes:
* Fix reservation test failures by increasing tolerance.
OpenZFS-issue: https://illumos.org/issues/9112
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3f3cc3c3
Closes #7682
2018-02-12 23:56:06 +03:00
|
|
|
for (int i = 0; i < spa->spa_alloc_count; i++) {
|
2020-12-15 21:55:44 +03:00
|
|
|
ASSERT0(zfs_refcount_count(&normal->mc_allocator[i].
|
|
|
|
mca_alloc_slots));
|
|
|
|
ASSERT0(zfs_refcount_count(&special->mc_allocator[i].
|
|
|
|
mca_alloc_slots));
|
|
|
|
ASSERT0(zfs_refcount_count(&dedup->mc_allocator[i].
|
|
|
|
mca_alloc_slots));
|
|
|
|
normal->mc_allocator[i].mca_alloc_max_slots =
|
|
|
|
slots_per_allocator;
|
|
|
|
special->mc_allocator[i].mca_alloc_max_slots =
|
|
|
|
slots_per_allocator;
|
|
|
|
dedup->mc_allocator[i].mca_alloc_max_slots =
|
|
|
|
slots_per_allocator;
|
2018-09-06 04:33:36 +03:00
|
|
|
}
|
|
|
|
normal->mc_alloc_throttle_enabled = zio_dva_throttle_enabled;
|
|
|
|
special->mc_alloc_throttle_enabled = zio_dva_throttle_enabled;
|
|
|
|
dedup->mc_alloc_throttle_enabled = zio_dva_throttle_enabled;
|
2019-01-18 20:50:16 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
spa_sync_condense_indirect(spa_t *spa, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
ASSERT(spa_writeable(spa));
|
2016-10-14 03:59:18 +03:00
|
|
|
|
2019-01-18 20:50:16 +03:00
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
for (int c = 0; c < rvd->vdev_children; c++) {
|
|
|
|
vdev_t *vd = rvd->vdev_child[c];
|
|
|
|
vdev_indirect_state_sync_verify(vd);
|
|
|
|
|
|
|
|
if (vdev_indirect_should_condense(vd)) {
|
|
|
|
spa_condense_indirect_start_sync(vd, tx);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2019-01-18 20:50:16 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
spa_sync_iterate_to_convergence(spa_t *spa, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
objset_t *mos = spa->spa_meta_objset;
|
|
|
|
dsl_pool_t *dp = spa->spa_dsl_pool;
|
|
|
|
uint64_t txg = tx->tx_txg;
|
|
|
|
bplist_t *free_bpl = &spa->spa_free_bplist[txg & TXG_MASK];
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
do {
|
2010-05-29 00:45:14 +04:00
|
|
|
int pass = ++spa->spa_sync_pass;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
spa_sync_config_object(spa, tx);
|
|
|
|
spa_sync_aux_dev(spa, &spa->spa_spares, tx,
|
|
|
|
ZPOOL_CONFIG_SPARES, DMU_POOL_SPARES);
|
|
|
|
spa_sync_aux_dev(spa, &spa->spa_l2cache, tx,
|
|
|
|
ZPOOL_CONFIG_L2CACHE, DMU_POOL_L2CACHE);
|
|
|
|
spa_errlog_sync(spa, txg);
|
|
|
|
dsl_pool_sync(dp, txg);
|
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
if (pass < zfs_sync_pass_deferred_free ||
|
|
|
|
spa_feature_is_active(spa, SPA_FEATURE_LOG_SPACEMAP)) {
|
|
|
|
/*
|
|
|
|
* If the log space map feature is active we don't
|
|
|
|
* care about deferred frees and the deferred bpobj
|
|
|
|
* as the log space map should effectively have the
|
|
|
|
* same results (i.e. appending only to one object).
|
|
|
|
*/
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
spa_sync_frees(spa, free_bpl, tx);
|
2010-05-29 00:45:14 +04:00
|
|
|
} else {
|
2015-07-11 03:19:41 +03:00
|
|
|
/*
|
|
|
|
* We can not defer frees in pass 1, because
|
|
|
|
* we sync the deferred frees later in pass 1.
|
|
|
|
*/
|
|
|
|
ASSERT3U(pass, >, 1);
|
2019-07-26 20:54:14 +03:00
|
|
|
bplist_iterate(free_bpl, bpobj_enqueue_alloc_cb,
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
&spa->spa_deferred_bpobj, tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
ddt_sync(spa, txg);
|
|
|
|
dsl_scan_sync(dp, tx);
|
2019-01-18 20:50:16 +03:00
|
|
|
svr_sync(spa, tx);
|
|
|
|
spa_sync_upgrades(spa, tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
spa_flush_metaslabs(spa, tx);
|
|
|
|
|
2019-01-18 20:50:16 +03:00
|
|
|
vdev_t *vd = NULL;
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
while ((vd = txg_list_remove(&spa->spa_vdev_txg_list, txg))
|
|
|
|
!= NULL)
|
2010-05-29 00:45:14 +04:00
|
|
|
vdev_sync(vd, txg);
|
|
|
|
|
2019-01-18 20:50:16 +03:00
|
|
|
/*
|
|
|
|
* Note: We need to check if the MOS is dirty because we could
|
|
|
|
* have marked the MOS dirty without updating the uberblock
|
|
|
|
* (e.g. if we have sync tasks but no dirty user data). We need
|
|
|
|
* to check the uberblock's rootbp because it is updated if we
|
|
|
|
* have synced out dirty data (though in this case the MOS will
|
|
|
|
* most likely also be dirty due to second order effects, we
|
|
|
|
* don't want to rely on that here).
|
|
|
|
*/
|
|
|
|
if (pass == 1 &&
|
|
|
|
spa->spa_uberblock.ub_rootbp.blk_birth < txg &&
|
|
|
|
!dmu_objset_is_dirty(mos, txg)) {
|
2015-07-11 03:19:41 +03:00
|
|
|
/*
|
2019-01-18 20:50:16 +03:00
|
|
|
* Nothing changed on the first pass, therefore this
|
|
|
|
* TXG is a no-op. Avoid syncing deferred frees, so
|
|
|
|
* that we can keep this TXG as a no-op.
|
2015-07-11 03:19:41 +03:00
|
|
|
*/
|
2019-01-18 20:50:16 +03:00
|
|
|
ASSERT(txg_list_empty(&dp->dp_dirty_datasets, txg));
|
|
|
|
ASSERT(txg_list_empty(&dp->dp_dirty_dirs, txg));
|
|
|
|
ASSERT(txg_list_empty(&dp->dp_sync_tasks, txg));
|
|
|
|
ASSERT(txg_list_empty(&dp->dp_early_sync_tasks, txg));
|
|
|
|
break;
|
2015-07-11 03:19:41 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2019-01-18 20:50:16 +03:00
|
|
|
spa_sync_deferred_frees(spa, tx);
|
2010-05-29 00:45:14 +04:00
|
|
|
} while (dmu_objset_is_dirty(mos, txg));
|
2019-01-18 20:50:16 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2019-01-18 20:50:16 +03:00
|
|
|
/*
|
|
|
|
* Rewrite the vdev configuration (which includes the uberblock) to
|
|
|
|
* commit the transaction group.
|
|
|
|
*
|
|
|
|
* If there are no dirty vdevs, we sync the uberblock to a few random
|
|
|
|
* top-level vdevs that are known to be visible in the config cache
|
|
|
|
* (see spa_vdev_add() for a complete description). If there *are* dirty
|
|
|
|
* vdevs, sync the uberblock to all vdevs.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
spa_sync_rewrite_vdev_config(spa_t *spa, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
uint64_t txg = tx->tx_txg;
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
for (;;) {
|
2019-01-18 20:50:16 +03:00
|
|
|
int error = 0;
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* We hold SCL_STATE to prevent vdev open/close/etc.
|
|
|
|
* while we're attempting to write the vdev labels.
|
|
|
|
*/
|
|
|
|
spa_config_enter(spa, SCL_STATE, FTAG, RW_READER);
|
|
|
|
|
|
|
|
if (list_is_empty(&spa->spa_config_dirty_list)) {
|
2016-12-17 01:11:29 +03:00
|
|
|
vdev_t *svd[SPA_SYNC_MIN_VDEVS] = { NULL };
|
2008-12-03 23:09:06 +03:00
|
|
|
int svdcount = 0;
|
|
|
|
int children = rvd->vdev_children;
|
2021-06-23 02:35:23 +03:00
|
|
|
int c0 = random_in_range(children);
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2017-11-04 23:25:13 +03:00
|
|
|
for (int c = 0; c < children; c++) {
|
2019-01-18 20:50:16 +03:00
|
|
|
vdev_t *vd =
|
|
|
|
rvd->vdev_child[(c0 + c) % children];
|
2016-12-17 01:11:29 +03:00
|
|
|
|
|
|
|
/* Stop when revisiting the first vdev */
|
|
|
|
if (c > 0 && svd[0] == vd)
|
|
|
|
break;
|
|
|
|
|
2019-01-18 20:50:16 +03:00
|
|
|
if (vd->vdev_ms_array == 0 ||
|
|
|
|
vd->vdev_islog ||
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
!vdev_is_concrete(vd))
|
2008-12-03 23:09:06 +03:00
|
|
|
continue;
|
2016-12-17 01:11:29 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
svd[svdcount++] = vd;
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
if (svdcount == SPA_SYNC_MIN_VDEVS)
|
2008-12-03 23:09:06 +03:00
|
|
|
break;
|
|
|
|
}
|
2016-01-27 04:27:46 +03:00
|
|
|
error = vdev_config_sync(svd, svdcount, txg);
|
2008-12-03 23:09:06 +03:00
|
|
|
} else {
|
|
|
|
error = vdev_config_sync(rvd->vdev_child,
|
2016-01-27 04:27:46 +03:00
|
|
|
rvd->vdev_children, txg);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2012-12-15 00:38:04 +04:00
|
|
|
if (error == 0)
|
|
|
|
spa->spa_last_synced_guid = rvd->vdev_guid;
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_exit(spa, SCL_STATE, FTAG);
|
|
|
|
|
|
|
|
if (error == 0)
|
|
|
|
break;
|
2018-03-15 20:56:55 +03:00
|
|
|
zio_suspend(spa, NULL, ZIO_SUSPEND_IOERR);
|
2008-12-03 23:09:06 +03:00
|
|
|
zio_resume_wait(spa);
|
|
|
|
}
|
2019-01-18 20:50:16 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Sync the specified transaction group. New blocks may be dirtied as
|
|
|
|
* part of the process, so we iterate until it converges.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
spa_sync(spa_t *spa, uint64_t txg)
|
|
|
|
{
|
|
|
|
vdev_t *vd = NULL;
|
|
|
|
|
|
|
|
VERIFY(spa_writeable(spa));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Wait for i/os issued in open context that need to complete
|
|
|
|
* before this txg syncs.
|
|
|
|
*/
|
|
|
|
(void) zio_wait(spa->spa_txg_zio[txg & TXG_MASK]);
|
|
|
|
spa->spa_txg_zio[txg & TXG_MASK] = zio_root(spa, NULL, NULL,
|
|
|
|
ZIO_FLAG_CANFAIL);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Lock out configuration changes.
|
|
|
|
*/
|
|
|
|
spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
|
|
|
|
|
|
|
|
spa->spa_syncing_txg = txg;
|
|
|
|
spa->spa_sync_pass = 0;
|
|
|
|
|
|
|
|
for (int i = 0; i < spa->spa_alloc_count; i++) {
|
2021-07-21 15:40:36 +03:00
|
|
|
mutex_enter(&spa->spa_allocs[i].spaa_lock);
|
|
|
|
VERIFY0(avl_numnodes(&spa->spa_allocs[i].spaa_tree));
|
|
|
|
mutex_exit(&spa->spa_allocs[i].spaa_lock);
|
2019-01-18 20:50:16 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If there are any pending vdev state changes, convert them
|
|
|
|
* into config changes that go out with this transaction group.
|
|
|
|
*/
|
|
|
|
spa_config_enter(spa, SCL_STATE, FTAG, RW_READER);
|
|
|
|
while (list_head(&spa->spa_state_dirty_list) != NULL) {
|
|
|
|
/*
|
|
|
|
* We need the write lock here because, for aux vdevs,
|
|
|
|
* calling vdev_config_dirty() modifies sav_config.
|
|
|
|
* This is ugly and will become unnecessary when we
|
|
|
|
* eliminate the aux vdev wart by integrating all vdevs
|
|
|
|
* into the root vdev tree.
|
|
|
|
*/
|
|
|
|
spa_config_exit(spa, SCL_CONFIG | SCL_STATE, FTAG);
|
|
|
|
spa_config_enter(spa, SCL_CONFIG | SCL_STATE, FTAG, RW_WRITER);
|
|
|
|
while ((vd = list_head(&spa->spa_state_dirty_list)) != NULL) {
|
|
|
|
vdev_state_clean(vd);
|
|
|
|
vdev_config_dirty(vd);
|
|
|
|
}
|
|
|
|
spa_config_exit(spa, SCL_CONFIG | SCL_STATE, FTAG);
|
|
|
|
spa_config_enter(spa, SCL_CONFIG | SCL_STATE, FTAG, RW_READER);
|
|
|
|
}
|
|
|
|
spa_config_exit(spa, SCL_STATE, FTAG);
|
|
|
|
|
|
|
|
dsl_pool_t *dp = spa->spa_dsl_pool;
|
|
|
|
dmu_tx_t *tx = dmu_tx_create_assigned(dp, txg);
|
|
|
|
|
|
|
|
spa->spa_sync_starttime = gethrtime();
|
|
|
|
taskq_cancel_id(system_delay_taskq, spa->spa_deadman_tqid);
|
|
|
|
spa->spa_deadman_tqid = taskq_dispatch_delay(system_delay_taskq,
|
|
|
|
spa_deadman, spa, TQ_SLEEP, ddi_get_lbolt() +
|
|
|
|
NSEC_TO_TICK(spa->spa_deadman_synctime));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we are upgrading to SPA_VERSION_RAIDZ_DEFLATE this txg,
|
|
|
|
* set spa_deflate if we have no raid-z vdevs.
|
|
|
|
*/
|
|
|
|
if (spa->spa_ubsync.ub_version < SPA_VERSION_RAIDZ_DEFLATE &&
|
|
|
|
spa->spa_uberblock.ub_version >= SPA_VERSION_RAIDZ_DEFLATE) {
|
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
|
|
|
|
int i;
|
|
|
|
for (i = 0; i < rvd->vdev_children; i++) {
|
|
|
|
vd = rvd->vdev_child[i];
|
|
|
|
if (vd->vdev_deflate_ratio != SPA_MINBLOCKSIZE)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (i == rvd->vdev_children) {
|
|
|
|
spa->spa_deflate = TRUE;
|
|
|
|
VERIFY0(zap_add(spa->spa_meta_objset,
|
|
|
|
DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_DEFLATE,
|
|
|
|
sizeof (uint64_t), 1, &spa->spa_deflate, tx));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
spa_sync_adjust_vdev_max_queue_depth(spa);
|
|
|
|
|
|
|
|
spa_sync_condense_indirect(spa, tx);
|
|
|
|
|
|
|
|
spa_sync_iterate_to_convergence(spa, tx);
|
|
|
|
|
|
|
|
#ifdef ZFS_DEBUG
|
|
|
|
if (!list_is_empty(&spa->spa_config_dirty_list)) {
|
|
|
|
/*
|
|
|
|
* Make sure that the number of ZAPs for all the vdevs matches
|
|
|
|
* the number of ZAPs in the per-vdev ZAP list. This only gets
|
|
|
|
* called if the config is dirty; otherwise there may be
|
|
|
|
* outstanding AVZ operations that weren't completed in
|
|
|
|
* spa_sync_config_object.
|
|
|
|
*/
|
|
|
|
uint64_t all_vdev_zap_entry_count;
|
|
|
|
ASSERT0(zap_count(spa->spa_meta_objset,
|
|
|
|
spa->spa_all_vdev_zaps, &all_vdev_zap_entry_count));
|
|
|
|
ASSERT3U(vdev_count_verify_zaps(spa->spa_root_vdev), ==,
|
|
|
|
all_vdev_zap_entry_count);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
if (spa->spa_vdev_removal != NULL) {
|
|
|
|
ASSERT0(spa->spa_vdev_removal->svr_bytes_done[txg & TXG_MASK]);
|
|
|
|
}
|
|
|
|
|
|
|
|
spa_sync_rewrite_vdev_config(spa, tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
dmu_tx_commit(tx);
|
|
|
|
|
2016-12-01 00:56:50 +03:00
|
|
|
taskq_cancel_id(system_delay_taskq, spa->spa_deadman_tqid);
|
2013-04-30 02:49:23 +04:00
|
|
|
spa->spa_deadman_tqid = 0;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Clear the dirty config list.
|
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
while ((vd = list_head(&spa->spa_config_dirty_list)) != NULL)
|
2008-11-20 23:01:55 +03:00
|
|
|
vdev_config_clean(vd);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Now that the new config has synced transactionally,
|
|
|
|
* let it become visible to the config cache.
|
|
|
|
*/
|
|
|
|
if (spa->spa_config_syncing != NULL) {
|
|
|
|
spa_config_set(spa, spa->spa_config_syncing);
|
|
|
|
spa->spa_config_txg = txg;
|
|
|
|
spa->spa_config_syncing = NULL;
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
dsl_pool_sync_done(dp, txg);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
OpenZFS 9112 - Improve allocation performance on high-end systems
Overview
========
We parallelize the allocation process by creating the concept of
"allocators". There are a certain number of allocators per metaslab
group, defined by the value of a tunable at pool open time. Each
allocator for a given metaslab group has up to 2 active metaslabs; one
"primary", and one "secondary". The primary and secondary weight mean
the same thing they did in in the pre-allocator world; primary metaslabs
are used for most allocations, secondary metaslabs are used for ditto
blocks being allocated in the same metaslab group. There is also the
CLAIM weight, which has been separated out from the other weights, but
that is less important to understanding the patch. The active metaslabs
for each allocator are moved from their normal place in the metaslab
tree for the group to the back of the tree. This way, they will not be
selected for use by other allocators searching for new metaslabs unless
all the passive metaslabs are unsuitable for allocations. If that does
happen, the allocators will "steal" from each other to ensure that IOs
don't fail until there is truly no space left to perform allocations.
In addition, the alloc queue for each metaslab group has been broken
into a separate queue for each allocator. We don't want to dramatically
increase the number of inflight IOs on low-end systems, because it can
significantly increase txg times. On the other hand, we want to ensure
that there are enough IOs for each allocator to allow for good
coalescing before sending the IOs to the disk. As a result, we take a
compromise path; each allocator's alloc queue max depth starts at a
certain value for every txg. Every time an IO completes, we increase the
max depth. This should hopefully provide a good balance between the two
failure modes, while not dramatically increasing complexity.
We also parallelize the spa_alloc_tree and spa_alloc_lock, which cause
very similar contention when selecting IOs to allocate. This
parallelization uses the same allocator scheme as metaslab selection.
Performance Results
===================
Performance improvements from this change can vary significantly based
on the number of CPUs in the system, whether or not the system has a
NUMA architecture, the speed of the drives, the values for the various
tunables, and the workload being performed. For an fio async sequential
write workload on a 24 core NUMA system with 256 GB of RAM and 8 128 GB
SSDs, there is a roughly 25% performance improvement.
Future Work
===========
Analysis of the performance of the system with this patch applied shows
that a significant new bottleneck is the vdev disk queues, which also
need to be parallelized. Prototyping of this change has occurred, and
there was a performance improvement, but more work needs to be done
before its stability has been verified and it is ready to be upstreamed.
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Porting Notes:
* Fix reservation test failures by increasing tolerance.
OpenZFS-issue: https://illumos.org/issues/9112
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3f3cc3c3
Closes #7682
2018-02-12 23:56:06 +03:00
|
|
|
for (int i = 0; i < spa->spa_alloc_count; i++) {
|
2021-07-21 15:40:36 +03:00
|
|
|
mutex_enter(&spa->spa_allocs[i].spaa_lock);
|
|
|
|
VERIFY0(avl_numnodes(&spa->spa_allocs[i].spaa_tree));
|
|
|
|
mutex_exit(&spa->spa_allocs[i].spaa_lock);
|
OpenZFS 9112 - Improve allocation performance on high-end systems
Overview
========
We parallelize the allocation process by creating the concept of
"allocators". There are a certain number of allocators per metaslab
group, defined by the value of a tunable at pool open time. Each
allocator for a given metaslab group has up to 2 active metaslabs; one
"primary", and one "secondary". The primary and secondary weight mean
the same thing they did in in the pre-allocator world; primary metaslabs
are used for most allocations, secondary metaslabs are used for ditto
blocks being allocated in the same metaslab group. There is also the
CLAIM weight, which has been separated out from the other weights, but
that is less important to understanding the patch. The active metaslabs
for each allocator are moved from their normal place in the metaslab
tree for the group to the back of the tree. This way, they will not be
selected for use by other allocators searching for new metaslabs unless
all the passive metaslabs are unsuitable for allocations. If that does
happen, the allocators will "steal" from each other to ensure that IOs
don't fail until there is truly no space left to perform allocations.
In addition, the alloc queue for each metaslab group has been broken
into a separate queue for each allocator. We don't want to dramatically
increase the number of inflight IOs on low-end systems, because it can
significantly increase txg times. On the other hand, we want to ensure
that there are enough IOs for each allocator to allow for good
coalescing before sending the IOs to the disk. As a result, we take a
compromise path; each allocator's alloc queue max depth starts at a
certain value for every txg. Every time an IO completes, we increase the
max depth. This should hopefully provide a good balance between the two
failure modes, while not dramatically increasing complexity.
We also parallelize the spa_alloc_tree and spa_alloc_lock, which cause
very similar contention when selecting IOs to allocate. This
parallelization uses the same allocator scheme as metaslab selection.
Performance Results
===================
Performance improvements from this change can vary significantly based
on the number of CPUs in the system, whether or not the system has a
NUMA architecture, the speed of the drives, the values for the various
tunables, and the workload being performed. For an fio async sequential
write workload on a 24 core NUMA system with 256 GB of RAM and 8 128 GB
SSDs, there is a roughly 25% performance improvement.
Future Work
===========
Analysis of the performance of the system with this patch applied shows
that a significant new bottleneck is the vdev disk queues, which also
need to be parallelized. Prototyping of this change has occurred, and
there was a performance improvement, but more work needs to be done
before its stability has been verified and it is ready to be upstreamed.
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Porting Notes:
* Fix reservation test failures by increasing tolerance.
OpenZFS-issue: https://illumos.org/issues/9112
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3f3cc3c3
Closes #7682
2018-02-12 23:56:06 +03:00
|
|
|
}
|
2016-10-14 03:59:18 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Update usable space statistics.
|
|
|
|
*/
|
OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2018-12-19 17:54:59 +03:00
|
|
|
while ((vd = txg_list_remove(&spa->spa_vdev_txg_list, TXG_CLEAN(txg)))
|
|
|
|
!= NULL)
|
2008-11-20 23:01:55 +03:00
|
|
|
vdev_sync_done(vd, txg);
|
Cap metaslab memory usage
On systems with large amounts of storage and high fragmentation, a huge
amount of space can be used by storing metaslab range trees. Since
metaslabs are only unloaded during a txg sync, and only if they have
been inactive for 8 txgs, it is possible to get into a state where all
of the system's memory is consumed by range trees and metaslabs, and
txgs cannot sync. While ZFS knows how to evict ARC data when needed,
it has no such mechanism for range tree data. This can result in boot
hangs for some system configurations.
First, we add the ability to unload metaslabs outside of syncing
context. Second, we store a multilist of all loaded metaslabs, sorted
by their selection txg, so we can quickly identify the oldest
metaslabs. We use a multilist to reduce lock contention during heavy
write workloads. Finally, we add logic that will unload a metaslab
when we're loading a new metaslab, if we're using more than a certain
fraction of the available memory on range trees.
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9128
2019-08-16 18:08:21 +03:00
|
|
|
|
|
|
|
metaslab_class_evict_old(spa->spa_normal_class, txg);
|
|
|
|
metaslab_class_evict_old(spa->spa_log_class, txg);
|
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
spa_sync_close_syncing_log_sm(spa);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_update_dspace(spa);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* It had better be the case that we didn't dirty anything
|
|
|
|
* since vdev_config_sync().
|
|
|
|
*/
|
|
|
|
ASSERT(txg_list_empty(&dp->dp_dirty_datasets, txg));
|
|
|
|
ASSERT(txg_list_empty(&dp->dp_dirty_dirs, txg));
|
|
|
|
ASSERT(txg_list_empty(&spa->spa_vdev_txg_list, txg));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
while (zfs_pause_spa_sync)
|
|
|
|
delay(1);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
spa->spa_sync_pass = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-11-06 06:43:56 +03:00
|
|
|
/*
|
|
|
|
* Update the last synced uberblock here. We want to do this at
|
|
|
|
* the end of spa_sync() so that consumers of spa_last_synced_txg()
|
|
|
|
* will be guaranteed that all the processing associated with
|
|
|
|
* that txg has been completed.
|
|
|
|
*/
|
|
|
|
spa->spa_ubsync = spa->spa_uberblock;
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_exit(spa, SCL_CONFIG, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_handle_ignored_writes(spa);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* If any async tasks have been requested, kick them off.
|
|
|
|
*/
|
|
|
|
spa_async_dispatch(spa);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Sync all pools. We don't want to hold the namespace lock across these
|
|
|
|
* operations, so we take a reference on the spa_t and drop the lock during the
|
|
|
|
* sync.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
spa_sync_allpools(void)
|
|
|
|
{
|
|
|
|
spa_t *spa = NULL;
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
while ((spa = spa_next(spa)) != NULL) {
|
2010-08-27 01:24:34 +04:00
|
|
|
if (spa_state(spa) != POOL_STATE_ACTIVE ||
|
|
|
|
!spa_writeable(spa) || spa_suspended(spa))
|
2008-11-20 23:01:55 +03:00
|
|
|
continue;
|
|
|
|
spa_open_ref(spa, FTAG);
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
spa_close(spa, FTAG);
|
|
|
|
}
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ==========================================================================
|
|
|
|
* Miscellaneous routines
|
|
|
|
* ==========================================================================
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remove all pools in the system.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
spa_evict_all(void)
|
|
|
|
{
|
|
|
|
spa_t *spa;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remove all cached state. All pools should be closed now,
|
|
|
|
* so every spa in the AVL tree should be unreferenced.
|
|
|
|
*/
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
while ((spa = spa_next(NULL)) != NULL) {
|
|
|
|
/*
|
|
|
|
* Stop async tasks. The async thread may need to detach
|
|
|
|
* a device that's been replaced, which requires grabbing
|
|
|
|
* spa_namespace_lock, so we must drop it here.
|
|
|
|
*/
|
|
|
|
spa_open_ref(spa, FTAG);
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
spa_async_suspend(spa);
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
spa_close(spa, FTAG);
|
|
|
|
|
|
|
|
if (spa->spa_state != POOL_STATE_UNINITIALIZED) {
|
|
|
|
spa_unload(spa);
|
|
|
|
spa_deactivate(spa);
|
|
|
|
}
|
|
|
|
spa_remove(spa);
|
|
|
|
}
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
vdev_t *
|
2009-07-03 02:44:48 +04:00
|
|
|
spa_lookup_by_guid(spa_t *spa, uint64_t guid, boolean_t aux)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2008-12-03 23:09:06 +03:00
|
|
|
vdev_t *vd;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
if ((vd = vdev_lookup_by_guid(spa->spa_root_vdev, guid)) != NULL)
|
|
|
|
return (vd);
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
if (aux) {
|
2008-12-03 23:09:06 +03:00
|
|
|
for (i = 0; i < spa->spa_l2cache.sav_count; i++) {
|
|
|
|
vd = spa->spa_l2cache.sav_vdevs[i];
|
2009-07-03 02:44:48 +04:00
|
|
|
if (vd->vdev_guid == guid)
|
|
|
|
return (vd);
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0; i < spa->spa_spares.sav_count; i++) {
|
|
|
|
vd = spa->spa_spares.sav_vdevs[i];
|
2008-12-03 23:09:06 +03:00
|
|
|
if (vd->vdev_guid == guid)
|
|
|
|
return (vd);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return (NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
spa_upgrade(spa_t *spa, uint64_t version)
|
|
|
|
{
|
2010-08-27 01:24:34 +04:00
|
|
|
ASSERT(spa_writeable(spa));
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* This should only be called for a non-faulted pool, and since a
|
|
|
|
* future version would result in an unopenable pool, this shouldn't be
|
|
|
|
* possible.
|
|
|
|
*/
|
2013-02-10 04:25:55 +04:00
|
|
|
ASSERT(SPA_VERSION_IS_SUPPORTED(spa->spa_uberblock.ub_version));
|
2014-06-06 01:19:08 +04:00
|
|
|
ASSERT3U(version, >=, spa->spa_uberblock.ub_version);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
spa->spa_uberblock.ub_version = version;
|
|
|
|
vdev_config_dirty(spa->spa_root_vdev);
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), 0);
|
|
|
|
}
|
|
|
|
|
2021-11-11 22:54:15 +03:00
|
|
|
static boolean_t
|
|
|
|
spa_has_aux_vdev(spa_t *spa, uint64_t guid, spa_aux_vdev_t *sav)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) spa;
|
2008-11-20 23:01:55 +03:00
|
|
|
int i;
|
2021-11-11 22:54:15 +03:00
|
|
|
uint64_t vdev_guid;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
for (i = 0; i < sav->sav_count; i++)
|
|
|
|
if (sav->sav_vdevs[i]->vdev_guid == guid)
|
|
|
|
return (B_TRUE);
|
|
|
|
|
|
|
|
for (i = 0; i < sav->sav_npending; i++) {
|
|
|
|
if (nvlist_lookup_uint64(sav->sav_pending[i], ZPOOL_CONFIG_GUID,
|
2021-11-11 22:54:15 +03:00
|
|
|
&vdev_guid) == 0 && vdev_guid == guid)
|
2008-11-20 23:01:55 +03:00
|
|
|
return (B_TRUE);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (B_FALSE);
|
|
|
|
}
|
|
|
|
|
2021-11-11 22:54:15 +03:00
|
|
|
boolean_t
|
|
|
|
spa_has_l2cache(spa_t *spa, uint64_t guid)
|
|
|
|
{
|
|
|
|
return (spa_has_aux_vdev(spa, guid, &spa->spa_l2cache));
|
|
|
|
}
|
|
|
|
|
|
|
|
boolean_t
|
|
|
|
spa_has_spare(spa_t *spa, uint64_t guid)
|
|
|
|
{
|
|
|
|
return (spa_has_aux_vdev(spa, guid, &spa->spa_spares));
|
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* Check if a pool has an active shared spare device.
|
|
|
|
* Note: reference count of an active spare is 2, as a spare and as a replace
|
|
|
|
*/
|
|
|
|
static boolean_t
|
|
|
|
spa_has_active_shared_spare(spa_t *spa)
|
|
|
|
{
|
|
|
|
int i, refcnt;
|
|
|
|
uint64_t pool;
|
|
|
|
spa_aux_vdev_t *sav = &spa->spa_spares;
|
|
|
|
|
|
|
|
for (i = 0; i < sav->sav_count; i++) {
|
|
|
|
if (spa_spare_exists(sav->sav_vdevs[i]->vdev_guid, &pool,
|
|
|
|
&refcnt) && pool != 0ULL && pool == spa_guid(spa) &&
|
|
|
|
refcnt > 2)
|
|
|
|
return (B_TRUE);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (B_FALSE);
|
|
|
|
}
|
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
uint64_t
|
|
|
|
spa_total_metaslabs(spa_t *spa)
|
|
|
|
{
|
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
|
|
|
|
uint64_t m = 0;
|
|
|
|
for (uint64_t c = 0; c < rvd->vdev_children; c++) {
|
|
|
|
vdev_t *vd = rvd->vdev_child[c];
|
|
|
|
if (!vdev_is_concrete(vd))
|
|
|
|
continue;
|
|
|
|
m += vd->vdev_ms_count;
|
|
|
|
}
|
|
|
|
return (m);
|
|
|
|
}
|
|
|
|
|
Add subcommand to wait for background zfs activity to complete
Currently the best way to wait for the completion of a long-running
operation in a pool, like a scrub or device removal, is to poll 'zpool
status' and parse its output, which is neither efficient nor convenient.
This change adds a 'wait' subcommand to the zpool command. When invoked,
'zpool wait' will block until a specified type of background activity
completes. Currently, this subcommand can wait for any of the following:
- Scrubs or resilvers to complete
- Devices to initialized
- Devices to be replaced
- Devices to be removed
- Checkpoints to be discarded
- Background freeing to complete
For example, a scrub that is in progress could be waited for by running
zpool wait -t scrub <pool>
This also adds a -w flag to the attach, checkpoint, initialize, replace,
remove, and scrub subcommands. When used, this flag makes the operations
kicked off by these subcommands synchronous instead of asynchronous.
This functionality is implemented using a new ioctl. The type of
activity to wait for is provided as input to the ioctl, and the ioctl
blocks until all activity of that type has completed. An ioctl was used
over other methods of kernel-userspace communiction primarily for the
sake of portability.
Porting Notes:
This is ported from Delphix OS change DLPX-44432. The following changes
were made while porting:
- Added ZoL-style ioctl input declaration.
- Reorganized error handling in zpool_initialize in libzfs to integrate
better with changes made for TRIM support.
- Fixed check for whether a checkpoint discard is in progress.
Previously it also waited if the pool had a checkpoint, instead of
just if a checkpoint was being discarded.
- Exposed zfs_initialize_chunk_size as a ZoL-style tunable.
- Updated more existing tests to make use of new 'zpool wait'
functionality, tests that don't exist in Delphix OS.
- Used existing ZoL tunable zfs_scan_suspend_progress, together with
zinject, in place of a new tunable zfs_scan_max_blks_per_txg.
- Added support for a non-integral interval argument to zpool wait.
Future work:
ZoL has support for trimming devices, which Delphix OS does not. In the
future, 'zpool wait' could be extended to add the ability to wait for
trim operations to complete.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #9162
2019-09-14 04:09:06 +03:00
|
|
|
/*
|
|
|
|
* Notify any waiting threads that some activity has switched from being in-
|
|
|
|
* progress to not-in-progress so that the thread can wake up and determine
|
|
|
|
* whether it is finished waiting.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
spa_notify_waiters(spa_t *spa)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Acquiring spa_activities_lock here prevents the cv_broadcast from
|
|
|
|
* happening between the waiting thread's check and cv_wait.
|
|
|
|
*/
|
|
|
|
mutex_enter(&spa->spa_activities_lock);
|
|
|
|
cv_broadcast(&spa->spa_activities_cv);
|
|
|
|
mutex_exit(&spa->spa_activities_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Notify any waiting threads that the pool is exporting, and then block until
|
|
|
|
* they are finished using the spa_t.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
spa_wake_waiters(spa_t *spa)
|
|
|
|
{
|
|
|
|
mutex_enter(&spa->spa_activities_lock);
|
|
|
|
spa->spa_waiters_cancel = B_TRUE;
|
|
|
|
cv_broadcast(&spa->spa_activities_cv);
|
|
|
|
while (spa->spa_waiters != 0)
|
|
|
|
cv_wait(&spa->spa_waiters_cv, &spa->spa_activities_lock);
|
|
|
|
spa->spa_waiters_cancel = B_FALSE;
|
|
|
|
mutex_exit(&spa->spa_activities_lock);
|
|
|
|
}
|
|
|
|
|
2020-03-05 02:07:11 +03:00
|
|
|
/* Whether the vdev or any of its descendants are being initialized/trimmed. */
|
Add subcommand to wait for background zfs activity to complete
Currently the best way to wait for the completion of a long-running
operation in a pool, like a scrub or device removal, is to poll 'zpool
status' and parse its output, which is neither efficient nor convenient.
This change adds a 'wait' subcommand to the zpool command. When invoked,
'zpool wait' will block until a specified type of background activity
completes. Currently, this subcommand can wait for any of the following:
- Scrubs or resilvers to complete
- Devices to initialized
- Devices to be replaced
- Devices to be removed
- Checkpoints to be discarded
- Background freeing to complete
For example, a scrub that is in progress could be waited for by running
zpool wait -t scrub <pool>
This also adds a -w flag to the attach, checkpoint, initialize, replace,
remove, and scrub subcommands. When used, this flag makes the operations
kicked off by these subcommands synchronous instead of asynchronous.
This functionality is implemented using a new ioctl. The type of
activity to wait for is provided as input to the ioctl, and the ioctl
blocks until all activity of that type has completed. An ioctl was used
over other methods of kernel-userspace communiction primarily for the
sake of portability.
Porting Notes:
This is ported from Delphix OS change DLPX-44432. The following changes
were made while porting:
- Added ZoL-style ioctl input declaration.
- Reorganized error handling in zpool_initialize in libzfs to integrate
better with changes made for TRIM support.
- Fixed check for whether a checkpoint discard is in progress.
Previously it also waited if the pool had a checkpoint, instead of
just if a checkpoint was being discarded.
- Exposed zfs_initialize_chunk_size as a ZoL-style tunable.
- Updated more existing tests to make use of new 'zpool wait'
functionality, tests that don't exist in Delphix OS.
- Used existing ZoL tunable zfs_scan_suspend_progress, together with
zinject, in place of a new tunable zfs_scan_max_blks_per_txg.
- Added support for a non-integral interval argument to zpool wait.
Future work:
ZoL has support for trimming devices, which Delphix OS does not. In the
future, 'zpool wait' could be extended to add the ability to wait for
trim operations to complete.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #9162
2019-09-14 04:09:06 +03:00
|
|
|
static boolean_t
|
2020-03-05 02:07:11 +03:00
|
|
|
spa_vdev_activity_in_progress_impl(vdev_t *vd, zpool_wait_activity_t activity)
|
Add subcommand to wait for background zfs activity to complete
Currently the best way to wait for the completion of a long-running
operation in a pool, like a scrub or device removal, is to poll 'zpool
status' and parse its output, which is neither efficient nor convenient.
This change adds a 'wait' subcommand to the zpool command. When invoked,
'zpool wait' will block until a specified type of background activity
completes. Currently, this subcommand can wait for any of the following:
- Scrubs or resilvers to complete
- Devices to initialized
- Devices to be replaced
- Devices to be removed
- Checkpoints to be discarded
- Background freeing to complete
For example, a scrub that is in progress could be waited for by running
zpool wait -t scrub <pool>
This also adds a -w flag to the attach, checkpoint, initialize, replace,
remove, and scrub subcommands. When used, this flag makes the operations
kicked off by these subcommands synchronous instead of asynchronous.
This functionality is implemented using a new ioctl. The type of
activity to wait for is provided as input to the ioctl, and the ioctl
blocks until all activity of that type has completed. An ioctl was used
over other methods of kernel-userspace communiction primarily for the
sake of portability.
Porting Notes:
This is ported from Delphix OS change DLPX-44432. The following changes
were made while porting:
- Added ZoL-style ioctl input declaration.
- Reorganized error handling in zpool_initialize in libzfs to integrate
better with changes made for TRIM support.
- Fixed check for whether a checkpoint discard is in progress.
Previously it also waited if the pool had a checkpoint, instead of
just if a checkpoint was being discarded.
- Exposed zfs_initialize_chunk_size as a ZoL-style tunable.
- Updated more existing tests to make use of new 'zpool wait'
functionality, tests that don't exist in Delphix OS.
- Used existing ZoL tunable zfs_scan_suspend_progress, together with
zinject, in place of a new tunable zfs_scan_max_blks_per_txg.
- Added support for a non-integral interval argument to zpool wait.
Future work:
ZoL has support for trimming devices, which Delphix OS does not. In the
future, 'zpool wait' could be extended to add the ability to wait for
trim operations to complete.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #9162
2019-09-14 04:09:06 +03:00
|
|
|
{
|
|
|
|
spa_t *spa = vd->vdev_spa;
|
|
|
|
|
|
|
|
ASSERT(spa_config_held(spa, SCL_CONFIG | SCL_STATE, RW_READER));
|
|
|
|
ASSERT(MUTEX_HELD(&spa->spa_activities_lock));
|
2020-03-05 02:07:11 +03:00
|
|
|
ASSERT(activity == ZPOOL_WAIT_INITIALIZE ||
|
|
|
|
activity == ZPOOL_WAIT_TRIM);
|
|
|
|
|
|
|
|
kmutex_t *lock = activity == ZPOOL_WAIT_INITIALIZE ?
|
|
|
|
&vd->vdev_initialize_lock : &vd->vdev_trim_lock;
|
Add subcommand to wait for background zfs activity to complete
Currently the best way to wait for the completion of a long-running
operation in a pool, like a scrub or device removal, is to poll 'zpool
status' and parse its output, which is neither efficient nor convenient.
This change adds a 'wait' subcommand to the zpool command. When invoked,
'zpool wait' will block until a specified type of background activity
completes. Currently, this subcommand can wait for any of the following:
- Scrubs or resilvers to complete
- Devices to initialized
- Devices to be replaced
- Devices to be removed
- Checkpoints to be discarded
- Background freeing to complete
For example, a scrub that is in progress could be waited for by running
zpool wait -t scrub <pool>
This also adds a -w flag to the attach, checkpoint, initialize, replace,
remove, and scrub subcommands. When used, this flag makes the operations
kicked off by these subcommands synchronous instead of asynchronous.
This functionality is implemented using a new ioctl. The type of
activity to wait for is provided as input to the ioctl, and the ioctl
blocks until all activity of that type has completed. An ioctl was used
over other methods of kernel-userspace communiction primarily for the
sake of portability.
Porting Notes:
This is ported from Delphix OS change DLPX-44432. The following changes
were made while porting:
- Added ZoL-style ioctl input declaration.
- Reorganized error handling in zpool_initialize in libzfs to integrate
better with changes made for TRIM support.
- Fixed check for whether a checkpoint discard is in progress.
Previously it also waited if the pool had a checkpoint, instead of
just if a checkpoint was being discarded.
- Exposed zfs_initialize_chunk_size as a ZoL-style tunable.
- Updated more existing tests to make use of new 'zpool wait'
functionality, tests that don't exist in Delphix OS.
- Used existing ZoL tunable zfs_scan_suspend_progress, together with
zinject, in place of a new tunable zfs_scan_max_blks_per_txg.
- Added support for a non-integral interval argument to zpool wait.
Future work:
ZoL has support for trimming devices, which Delphix OS does not. In the
future, 'zpool wait' could be extended to add the ability to wait for
trim operations to complete.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #9162
2019-09-14 04:09:06 +03:00
|
|
|
|
|
|
|
mutex_exit(&spa->spa_activities_lock);
|
2020-03-05 02:07:11 +03:00
|
|
|
mutex_enter(lock);
|
Add subcommand to wait for background zfs activity to complete
Currently the best way to wait for the completion of a long-running
operation in a pool, like a scrub or device removal, is to poll 'zpool
status' and parse its output, which is neither efficient nor convenient.
This change adds a 'wait' subcommand to the zpool command. When invoked,
'zpool wait' will block until a specified type of background activity
completes. Currently, this subcommand can wait for any of the following:
- Scrubs or resilvers to complete
- Devices to initialized
- Devices to be replaced
- Devices to be removed
- Checkpoints to be discarded
- Background freeing to complete
For example, a scrub that is in progress could be waited for by running
zpool wait -t scrub <pool>
This also adds a -w flag to the attach, checkpoint, initialize, replace,
remove, and scrub subcommands. When used, this flag makes the operations
kicked off by these subcommands synchronous instead of asynchronous.
This functionality is implemented using a new ioctl. The type of
activity to wait for is provided as input to the ioctl, and the ioctl
blocks until all activity of that type has completed. An ioctl was used
over other methods of kernel-userspace communiction primarily for the
sake of portability.
Porting Notes:
This is ported from Delphix OS change DLPX-44432. The following changes
were made while porting:
- Added ZoL-style ioctl input declaration.
- Reorganized error handling in zpool_initialize in libzfs to integrate
better with changes made for TRIM support.
- Fixed check for whether a checkpoint discard is in progress.
Previously it also waited if the pool had a checkpoint, instead of
just if a checkpoint was being discarded.
- Exposed zfs_initialize_chunk_size as a ZoL-style tunable.
- Updated more existing tests to make use of new 'zpool wait'
functionality, tests that don't exist in Delphix OS.
- Used existing ZoL tunable zfs_scan_suspend_progress, together with
zinject, in place of a new tunable zfs_scan_max_blks_per_txg.
- Added support for a non-integral interval argument to zpool wait.
Future work:
ZoL has support for trimming devices, which Delphix OS does not. In the
future, 'zpool wait' could be extended to add the ability to wait for
trim operations to complete.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #9162
2019-09-14 04:09:06 +03:00
|
|
|
mutex_enter(&spa->spa_activities_lock);
|
|
|
|
|
2020-03-05 02:07:11 +03:00
|
|
|
boolean_t in_progress = (activity == ZPOOL_WAIT_INITIALIZE) ?
|
|
|
|
(vd->vdev_initialize_state == VDEV_INITIALIZE_ACTIVE) :
|
|
|
|
(vd->vdev_trim_state == VDEV_TRIM_ACTIVE);
|
|
|
|
mutex_exit(lock);
|
Add subcommand to wait for background zfs activity to complete
Currently the best way to wait for the completion of a long-running
operation in a pool, like a scrub or device removal, is to poll 'zpool
status' and parse its output, which is neither efficient nor convenient.
This change adds a 'wait' subcommand to the zpool command. When invoked,
'zpool wait' will block until a specified type of background activity
completes. Currently, this subcommand can wait for any of the following:
- Scrubs or resilvers to complete
- Devices to initialized
- Devices to be replaced
- Devices to be removed
- Checkpoints to be discarded
- Background freeing to complete
For example, a scrub that is in progress could be waited for by running
zpool wait -t scrub <pool>
This also adds a -w flag to the attach, checkpoint, initialize, replace,
remove, and scrub subcommands. When used, this flag makes the operations
kicked off by these subcommands synchronous instead of asynchronous.
This functionality is implemented using a new ioctl. The type of
activity to wait for is provided as input to the ioctl, and the ioctl
blocks until all activity of that type has completed. An ioctl was used
over other methods of kernel-userspace communiction primarily for the
sake of portability.
Porting Notes:
This is ported from Delphix OS change DLPX-44432. The following changes
were made while porting:
- Added ZoL-style ioctl input declaration.
- Reorganized error handling in zpool_initialize in libzfs to integrate
better with changes made for TRIM support.
- Fixed check for whether a checkpoint discard is in progress.
Previously it also waited if the pool had a checkpoint, instead of
just if a checkpoint was being discarded.
- Exposed zfs_initialize_chunk_size as a ZoL-style tunable.
- Updated more existing tests to make use of new 'zpool wait'
functionality, tests that don't exist in Delphix OS.
- Used existing ZoL tunable zfs_scan_suspend_progress, together with
zinject, in place of a new tunable zfs_scan_max_blks_per_txg.
- Added support for a non-integral interval argument to zpool wait.
Future work:
ZoL has support for trimming devices, which Delphix OS does not. In the
future, 'zpool wait' could be extended to add the ability to wait for
trim operations to complete.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #9162
2019-09-14 04:09:06 +03:00
|
|
|
|
2020-03-05 02:07:11 +03:00
|
|
|
if (in_progress)
|
Add subcommand to wait for background zfs activity to complete
Currently the best way to wait for the completion of a long-running
operation in a pool, like a scrub or device removal, is to poll 'zpool
status' and parse its output, which is neither efficient nor convenient.
This change adds a 'wait' subcommand to the zpool command. When invoked,
'zpool wait' will block until a specified type of background activity
completes. Currently, this subcommand can wait for any of the following:
- Scrubs or resilvers to complete
- Devices to initialized
- Devices to be replaced
- Devices to be removed
- Checkpoints to be discarded
- Background freeing to complete
For example, a scrub that is in progress could be waited for by running
zpool wait -t scrub <pool>
This also adds a -w flag to the attach, checkpoint, initialize, replace,
remove, and scrub subcommands. When used, this flag makes the operations
kicked off by these subcommands synchronous instead of asynchronous.
This functionality is implemented using a new ioctl. The type of
activity to wait for is provided as input to the ioctl, and the ioctl
blocks until all activity of that type has completed. An ioctl was used
over other methods of kernel-userspace communiction primarily for the
sake of portability.
Porting Notes:
This is ported from Delphix OS change DLPX-44432. The following changes
were made while porting:
- Added ZoL-style ioctl input declaration.
- Reorganized error handling in zpool_initialize in libzfs to integrate
better with changes made for TRIM support.
- Fixed check for whether a checkpoint discard is in progress.
Previously it also waited if the pool had a checkpoint, instead of
just if a checkpoint was being discarded.
- Exposed zfs_initialize_chunk_size as a ZoL-style tunable.
- Updated more existing tests to make use of new 'zpool wait'
functionality, tests that don't exist in Delphix OS.
- Used existing ZoL tunable zfs_scan_suspend_progress, together with
zinject, in place of a new tunable zfs_scan_max_blks_per_txg.
- Added support for a non-integral interval argument to zpool wait.
Future work:
ZoL has support for trimming devices, which Delphix OS does not. In the
future, 'zpool wait' could be extended to add the ability to wait for
trim operations to complete.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #9162
2019-09-14 04:09:06 +03:00
|
|
|
return (B_TRUE);
|
|
|
|
|
|
|
|
for (int i = 0; i < vd->vdev_children; i++) {
|
2020-03-05 02:07:11 +03:00
|
|
|
if (spa_vdev_activity_in_progress_impl(vd->vdev_child[i],
|
|
|
|
activity))
|
Add subcommand to wait for background zfs activity to complete
Currently the best way to wait for the completion of a long-running
operation in a pool, like a scrub or device removal, is to poll 'zpool
status' and parse its output, which is neither efficient nor convenient.
This change adds a 'wait' subcommand to the zpool command. When invoked,
'zpool wait' will block until a specified type of background activity
completes. Currently, this subcommand can wait for any of the following:
- Scrubs or resilvers to complete
- Devices to initialized
- Devices to be replaced
- Devices to be removed
- Checkpoints to be discarded
- Background freeing to complete
For example, a scrub that is in progress could be waited for by running
zpool wait -t scrub <pool>
This also adds a -w flag to the attach, checkpoint, initialize, replace,
remove, and scrub subcommands. When used, this flag makes the operations
kicked off by these subcommands synchronous instead of asynchronous.
This functionality is implemented using a new ioctl. The type of
activity to wait for is provided as input to the ioctl, and the ioctl
blocks until all activity of that type has completed. An ioctl was used
over other methods of kernel-userspace communiction primarily for the
sake of portability.
Porting Notes:
This is ported from Delphix OS change DLPX-44432. The following changes
were made while porting:
- Added ZoL-style ioctl input declaration.
- Reorganized error handling in zpool_initialize in libzfs to integrate
better with changes made for TRIM support.
- Fixed check for whether a checkpoint discard is in progress.
Previously it also waited if the pool had a checkpoint, instead of
just if a checkpoint was being discarded.
- Exposed zfs_initialize_chunk_size as a ZoL-style tunable.
- Updated more existing tests to make use of new 'zpool wait'
functionality, tests that don't exist in Delphix OS.
- Used existing ZoL tunable zfs_scan_suspend_progress, together with
zinject, in place of a new tunable zfs_scan_max_blks_per_txg.
- Added support for a non-integral interval argument to zpool wait.
Future work:
ZoL has support for trimming devices, which Delphix OS does not. In the
future, 'zpool wait' could be extended to add the ability to wait for
trim operations to complete.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #9162
2019-09-14 04:09:06 +03:00
|
|
|
return (B_TRUE);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (B_FALSE);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If use_guid is true, this checks whether the vdev specified by guid is
|
2020-03-05 02:07:11 +03:00
|
|
|
* being initialized/trimmed. Otherwise, it checks whether any vdev in the pool
|
|
|
|
* is being initialized/trimmed. The caller must hold the config lock and
|
|
|
|
* spa_activities_lock.
|
Add subcommand to wait for background zfs activity to complete
Currently the best way to wait for the completion of a long-running
operation in a pool, like a scrub or device removal, is to poll 'zpool
status' and parse its output, which is neither efficient nor convenient.
This change adds a 'wait' subcommand to the zpool command. When invoked,
'zpool wait' will block until a specified type of background activity
completes. Currently, this subcommand can wait for any of the following:
- Scrubs or resilvers to complete
- Devices to initialized
- Devices to be replaced
- Devices to be removed
- Checkpoints to be discarded
- Background freeing to complete
For example, a scrub that is in progress could be waited for by running
zpool wait -t scrub <pool>
This also adds a -w flag to the attach, checkpoint, initialize, replace,
remove, and scrub subcommands. When used, this flag makes the operations
kicked off by these subcommands synchronous instead of asynchronous.
This functionality is implemented using a new ioctl. The type of
activity to wait for is provided as input to the ioctl, and the ioctl
blocks until all activity of that type has completed. An ioctl was used
over other methods of kernel-userspace communiction primarily for the
sake of portability.
Porting Notes:
This is ported from Delphix OS change DLPX-44432. The following changes
were made while porting:
- Added ZoL-style ioctl input declaration.
- Reorganized error handling in zpool_initialize in libzfs to integrate
better with changes made for TRIM support.
- Fixed check for whether a checkpoint discard is in progress.
Previously it also waited if the pool had a checkpoint, instead of
just if a checkpoint was being discarded.
- Exposed zfs_initialize_chunk_size as a ZoL-style tunable.
- Updated more existing tests to make use of new 'zpool wait'
functionality, tests that don't exist in Delphix OS.
- Used existing ZoL tunable zfs_scan_suspend_progress, together with
zinject, in place of a new tunable zfs_scan_max_blks_per_txg.
- Added support for a non-integral interval argument to zpool wait.
Future work:
ZoL has support for trimming devices, which Delphix OS does not. In the
future, 'zpool wait' could be extended to add the ability to wait for
trim operations to complete.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #9162
2019-09-14 04:09:06 +03:00
|
|
|
*/
|
|
|
|
static int
|
2020-03-05 02:07:11 +03:00
|
|
|
spa_vdev_activity_in_progress(spa_t *spa, boolean_t use_guid, uint64_t guid,
|
|
|
|
zpool_wait_activity_t activity, boolean_t *in_progress)
|
Add subcommand to wait for background zfs activity to complete
Currently the best way to wait for the completion of a long-running
operation in a pool, like a scrub or device removal, is to poll 'zpool
status' and parse its output, which is neither efficient nor convenient.
This change adds a 'wait' subcommand to the zpool command. When invoked,
'zpool wait' will block until a specified type of background activity
completes. Currently, this subcommand can wait for any of the following:
- Scrubs or resilvers to complete
- Devices to initialized
- Devices to be replaced
- Devices to be removed
- Checkpoints to be discarded
- Background freeing to complete
For example, a scrub that is in progress could be waited for by running
zpool wait -t scrub <pool>
This also adds a -w flag to the attach, checkpoint, initialize, replace,
remove, and scrub subcommands. When used, this flag makes the operations
kicked off by these subcommands synchronous instead of asynchronous.
This functionality is implemented using a new ioctl. The type of
activity to wait for is provided as input to the ioctl, and the ioctl
blocks until all activity of that type has completed. An ioctl was used
over other methods of kernel-userspace communiction primarily for the
sake of portability.
Porting Notes:
This is ported from Delphix OS change DLPX-44432. The following changes
were made while porting:
- Added ZoL-style ioctl input declaration.
- Reorganized error handling in zpool_initialize in libzfs to integrate
better with changes made for TRIM support.
- Fixed check for whether a checkpoint discard is in progress.
Previously it also waited if the pool had a checkpoint, instead of
just if a checkpoint was being discarded.
- Exposed zfs_initialize_chunk_size as a ZoL-style tunable.
- Updated more existing tests to make use of new 'zpool wait'
functionality, tests that don't exist in Delphix OS.
- Used existing ZoL tunable zfs_scan_suspend_progress, together with
zinject, in place of a new tunable zfs_scan_max_blks_per_txg.
- Added support for a non-integral interval argument to zpool wait.
Future work:
ZoL has support for trimming devices, which Delphix OS does not. In the
future, 'zpool wait' could be extended to add the ability to wait for
trim operations to complete.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #9162
2019-09-14 04:09:06 +03:00
|
|
|
{
|
|
|
|
mutex_exit(&spa->spa_activities_lock);
|
|
|
|
spa_config_enter(spa, SCL_CONFIG | SCL_STATE, FTAG, RW_READER);
|
|
|
|
mutex_enter(&spa->spa_activities_lock);
|
|
|
|
|
|
|
|
vdev_t *vd;
|
|
|
|
if (use_guid) {
|
|
|
|
vd = spa_lookup_by_guid(spa, guid, B_FALSE);
|
|
|
|
if (vd == NULL || !vd->vdev_ops->vdev_op_leaf) {
|
|
|
|
spa_config_exit(spa, SCL_CONFIG | SCL_STATE, FTAG);
|
|
|
|
return (EINVAL);
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
vd = spa->spa_root_vdev;
|
|
|
|
}
|
|
|
|
|
2020-03-05 02:07:11 +03:00
|
|
|
*in_progress = spa_vdev_activity_in_progress_impl(vd, activity);
|
Add subcommand to wait for background zfs activity to complete
Currently the best way to wait for the completion of a long-running
operation in a pool, like a scrub or device removal, is to poll 'zpool
status' and parse its output, which is neither efficient nor convenient.
This change adds a 'wait' subcommand to the zpool command. When invoked,
'zpool wait' will block until a specified type of background activity
completes. Currently, this subcommand can wait for any of the following:
- Scrubs or resilvers to complete
- Devices to initialized
- Devices to be replaced
- Devices to be removed
- Checkpoints to be discarded
- Background freeing to complete
For example, a scrub that is in progress could be waited for by running
zpool wait -t scrub <pool>
This also adds a -w flag to the attach, checkpoint, initialize, replace,
remove, and scrub subcommands. When used, this flag makes the operations
kicked off by these subcommands synchronous instead of asynchronous.
This functionality is implemented using a new ioctl. The type of
activity to wait for is provided as input to the ioctl, and the ioctl
blocks until all activity of that type has completed. An ioctl was used
over other methods of kernel-userspace communiction primarily for the
sake of portability.
Porting Notes:
This is ported from Delphix OS change DLPX-44432. The following changes
were made while porting:
- Added ZoL-style ioctl input declaration.
- Reorganized error handling in zpool_initialize in libzfs to integrate
better with changes made for TRIM support.
- Fixed check for whether a checkpoint discard is in progress.
Previously it also waited if the pool had a checkpoint, instead of
just if a checkpoint was being discarded.
- Exposed zfs_initialize_chunk_size as a ZoL-style tunable.
- Updated more existing tests to make use of new 'zpool wait'
functionality, tests that don't exist in Delphix OS.
- Used existing ZoL tunable zfs_scan_suspend_progress, together with
zinject, in place of a new tunable zfs_scan_max_blks_per_txg.
- Added support for a non-integral interval argument to zpool wait.
Future work:
ZoL has support for trimming devices, which Delphix OS does not. In the
future, 'zpool wait' could be extended to add the ability to wait for
trim operations to complete.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #9162
2019-09-14 04:09:06 +03:00
|
|
|
|
|
|
|
spa_config_exit(spa, SCL_CONFIG | SCL_STATE, FTAG);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Locking for waiting threads
|
|
|
|
* ---------------------------
|
|
|
|
*
|
|
|
|
* Waiting threads need a way to check whether a given activity is in progress,
|
|
|
|
* and then, if it is, wait for it to complete. Each activity will have some
|
|
|
|
* in-memory representation of the relevant on-disk state which can be used to
|
|
|
|
* determine whether or not the activity is in progress. The in-memory state and
|
|
|
|
* the locking used to protect it will be different for each activity, and may
|
|
|
|
* not be suitable for use with a cvar (e.g., some state is protected by the
|
|
|
|
* config lock). To allow waiting threads to wait without any races, another
|
|
|
|
* lock, spa_activities_lock, is used.
|
|
|
|
*
|
|
|
|
* When the state is checked, both the activity-specific lock (if there is one)
|
|
|
|
* and spa_activities_lock are held. In some cases, the activity-specific lock
|
|
|
|
* is acquired explicitly (e.g. the config lock). In others, the locking is
|
|
|
|
* internal to some check (e.g. bpobj_is_empty). After checking, the waiting
|
|
|
|
* thread releases the activity-specific lock and, if the activity is in
|
|
|
|
* progress, then cv_waits using spa_activities_lock.
|
|
|
|
*
|
|
|
|
* The waiting thread is woken when another thread, one completing some
|
|
|
|
* activity, updates the state of the activity and then calls
|
|
|
|
* spa_notify_waiters, which will cv_broadcast. This 'completing' thread only
|
|
|
|
* needs to hold its activity-specific lock when updating the state, and this
|
|
|
|
* lock can (but doesn't have to) be dropped before calling spa_notify_waiters.
|
|
|
|
*
|
|
|
|
* Because spa_notify_waiters acquires spa_activities_lock before broadcasting,
|
|
|
|
* and because it is held when the waiting thread checks the state of the
|
|
|
|
* activity, it can never be the case that the completing thread both updates
|
|
|
|
* the activity state and cv_broadcasts in between the waiting thread's check
|
|
|
|
* and cv_wait. Thus, a waiting thread can never miss a wakeup.
|
|
|
|
*
|
|
|
|
* In order to prevent deadlock, when the waiting thread does its check, in some
|
|
|
|
* cases it will temporarily drop spa_activities_lock in order to acquire the
|
|
|
|
* activity-specific lock. The order in which spa_activities_lock and the
|
|
|
|
* activity specific lock are acquired in the waiting thread is determined by
|
|
|
|
* the order in which they are acquired in the completing thread; if the
|
|
|
|
* completing thread calls spa_notify_waiters with the activity-specific lock
|
|
|
|
* held, then the waiting thread must also acquire the activity-specific lock
|
|
|
|
* first.
|
|
|
|
*/
|
|
|
|
|
|
|
|
static int
|
|
|
|
spa_activity_in_progress(spa_t *spa, zpool_wait_activity_t activity,
|
|
|
|
boolean_t use_tag, uint64_t tag, boolean_t *in_progress)
|
|
|
|
{
|
|
|
|
int error = 0;
|
|
|
|
|
|
|
|
ASSERT(MUTEX_HELD(&spa->spa_activities_lock));
|
|
|
|
|
|
|
|
switch (activity) {
|
|
|
|
case ZPOOL_WAIT_CKPT_DISCARD:
|
|
|
|
*in_progress =
|
|
|
|
(spa_feature_is_active(spa, SPA_FEATURE_POOL_CHECKPOINT) &&
|
|
|
|
zap_contains(spa_meta_objset(spa),
|
|
|
|
DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_ZPOOL_CHECKPOINT) ==
|
|
|
|
ENOENT);
|
|
|
|
break;
|
|
|
|
case ZPOOL_WAIT_FREE:
|
|
|
|
*in_progress = ((spa_version(spa) >= SPA_VERSION_DEADLISTS &&
|
|
|
|
!bpobj_is_empty(&spa->spa_dsl_pool->dp_free_bpobj)) ||
|
|
|
|
spa_feature_is_active(spa, SPA_FEATURE_ASYNC_DESTROY) ||
|
|
|
|
spa_livelist_delete_check(spa));
|
|
|
|
break;
|
|
|
|
case ZPOOL_WAIT_INITIALIZE:
|
2020-03-05 02:07:11 +03:00
|
|
|
case ZPOOL_WAIT_TRIM:
|
|
|
|
error = spa_vdev_activity_in_progress(spa, use_tag, tag,
|
|
|
|
activity, in_progress);
|
Add subcommand to wait for background zfs activity to complete
Currently the best way to wait for the completion of a long-running
operation in a pool, like a scrub or device removal, is to poll 'zpool
status' and parse its output, which is neither efficient nor convenient.
This change adds a 'wait' subcommand to the zpool command. When invoked,
'zpool wait' will block until a specified type of background activity
completes. Currently, this subcommand can wait for any of the following:
- Scrubs or resilvers to complete
- Devices to initialized
- Devices to be replaced
- Devices to be removed
- Checkpoints to be discarded
- Background freeing to complete
For example, a scrub that is in progress could be waited for by running
zpool wait -t scrub <pool>
This also adds a -w flag to the attach, checkpoint, initialize, replace,
remove, and scrub subcommands. When used, this flag makes the operations
kicked off by these subcommands synchronous instead of asynchronous.
This functionality is implemented using a new ioctl. The type of
activity to wait for is provided as input to the ioctl, and the ioctl
blocks until all activity of that type has completed. An ioctl was used
over other methods of kernel-userspace communiction primarily for the
sake of portability.
Porting Notes:
This is ported from Delphix OS change DLPX-44432. The following changes
were made while porting:
- Added ZoL-style ioctl input declaration.
- Reorganized error handling in zpool_initialize in libzfs to integrate
better with changes made for TRIM support.
- Fixed check for whether a checkpoint discard is in progress.
Previously it also waited if the pool had a checkpoint, instead of
just if a checkpoint was being discarded.
- Exposed zfs_initialize_chunk_size as a ZoL-style tunable.
- Updated more existing tests to make use of new 'zpool wait'
functionality, tests that don't exist in Delphix OS.
- Used existing ZoL tunable zfs_scan_suspend_progress, together with
zinject, in place of a new tunable zfs_scan_max_blks_per_txg.
- Added support for a non-integral interval argument to zpool wait.
Future work:
ZoL has support for trimming devices, which Delphix OS does not. In the
future, 'zpool wait' could be extended to add the ability to wait for
trim operations to complete.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #9162
2019-09-14 04:09:06 +03:00
|
|
|
break;
|
|
|
|
case ZPOOL_WAIT_REPLACE:
|
|
|
|
mutex_exit(&spa->spa_activities_lock);
|
|
|
|
spa_config_enter(spa, SCL_CONFIG | SCL_STATE, FTAG, RW_READER);
|
|
|
|
mutex_enter(&spa->spa_activities_lock);
|
|
|
|
|
|
|
|
*in_progress = vdev_replace_in_progress(spa->spa_root_vdev);
|
|
|
|
spa_config_exit(spa, SCL_CONFIG | SCL_STATE, FTAG);
|
|
|
|
break;
|
|
|
|
case ZPOOL_WAIT_REMOVE:
|
|
|
|
*in_progress = (spa->spa_removing_phys.sr_state ==
|
|
|
|
DSS_SCANNING);
|
|
|
|
break;
|
|
|
|
case ZPOOL_WAIT_RESILVER:
|
2020-07-03 21:05:50 +03:00
|
|
|
if ((*in_progress = vdev_rebuild_active(spa->spa_root_vdev)))
|
|
|
|
break;
|
2022-02-15 19:58:59 +03:00
|
|
|
zfs_fallthrough;
|
Add subcommand to wait for background zfs activity to complete
Currently the best way to wait for the completion of a long-running
operation in a pool, like a scrub or device removal, is to poll 'zpool
status' and parse its output, which is neither efficient nor convenient.
This change adds a 'wait' subcommand to the zpool command. When invoked,
'zpool wait' will block until a specified type of background activity
completes. Currently, this subcommand can wait for any of the following:
- Scrubs or resilvers to complete
- Devices to initialized
- Devices to be replaced
- Devices to be removed
- Checkpoints to be discarded
- Background freeing to complete
For example, a scrub that is in progress could be waited for by running
zpool wait -t scrub <pool>
This also adds a -w flag to the attach, checkpoint, initialize, replace,
remove, and scrub subcommands. When used, this flag makes the operations
kicked off by these subcommands synchronous instead of asynchronous.
This functionality is implemented using a new ioctl. The type of
activity to wait for is provided as input to the ioctl, and the ioctl
blocks until all activity of that type has completed. An ioctl was used
over other methods of kernel-userspace communiction primarily for the
sake of portability.
Porting Notes:
This is ported from Delphix OS change DLPX-44432. The following changes
were made while porting:
- Added ZoL-style ioctl input declaration.
- Reorganized error handling in zpool_initialize in libzfs to integrate
better with changes made for TRIM support.
- Fixed check for whether a checkpoint discard is in progress.
Previously it also waited if the pool had a checkpoint, instead of
just if a checkpoint was being discarded.
- Exposed zfs_initialize_chunk_size as a ZoL-style tunable.
- Updated more existing tests to make use of new 'zpool wait'
functionality, tests that don't exist in Delphix OS.
- Used existing ZoL tunable zfs_scan_suspend_progress, together with
zinject, in place of a new tunable zfs_scan_max_blks_per_txg.
- Added support for a non-integral interval argument to zpool wait.
Future work:
ZoL has support for trimming devices, which Delphix OS does not. In the
future, 'zpool wait' could be extended to add the ability to wait for
trim operations to complete.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #9162
2019-09-14 04:09:06 +03:00
|
|
|
case ZPOOL_WAIT_SCRUB:
|
|
|
|
{
|
|
|
|
boolean_t scanning, paused, is_scrub;
|
|
|
|
dsl_scan_t *scn = spa->spa_dsl_pool->dp_scan;
|
|
|
|
|
|
|
|
is_scrub = (scn->scn_phys.scn_func == POOL_SCAN_SCRUB);
|
|
|
|
scanning = (scn->scn_phys.scn_state == DSS_SCANNING);
|
|
|
|
paused = dsl_scan_is_paused_scrub(scn);
|
|
|
|
*in_progress = (scanning && !paused &&
|
|
|
|
is_scrub == (activity == ZPOOL_WAIT_SCRUB));
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
default:
|
|
|
|
panic("unrecognized value for activity %d", activity);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
spa_wait_common(const char *pool, zpool_wait_activity_t activity,
|
|
|
|
boolean_t use_tag, uint64_t tag, boolean_t *waited)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* The tag is used to distinguish between instances of an activity.
|
2020-03-05 02:07:11 +03:00
|
|
|
* 'initialize' and 'trim' are the only activities that we use this for.
|
|
|
|
* The other activities can only have a single instance in progress in a
|
|
|
|
* pool at one time, making the tag unnecessary.
|
Add subcommand to wait for background zfs activity to complete
Currently the best way to wait for the completion of a long-running
operation in a pool, like a scrub or device removal, is to poll 'zpool
status' and parse its output, which is neither efficient nor convenient.
This change adds a 'wait' subcommand to the zpool command. When invoked,
'zpool wait' will block until a specified type of background activity
completes. Currently, this subcommand can wait for any of the following:
- Scrubs or resilvers to complete
- Devices to initialized
- Devices to be replaced
- Devices to be removed
- Checkpoints to be discarded
- Background freeing to complete
For example, a scrub that is in progress could be waited for by running
zpool wait -t scrub <pool>
This also adds a -w flag to the attach, checkpoint, initialize, replace,
remove, and scrub subcommands. When used, this flag makes the operations
kicked off by these subcommands synchronous instead of asynchronous.
This functionality is implemented using a new ioctl. The type of
activity to wait for is provided as input to the ioctl, and the ioctl
blocks until all activity of that type has completed. An ioctl was used
over other methods of kernel-userspace communiction primarily for the
sake of portability.
Porting Notes:
This is ported from Delphix OS change DLPX-44432. The following changes
were made while porting:
- Added ZoL-style ioctl input declaration.
- Reorganized error handling in zpool_initialize in libzfs to integrate
better with changes made for TRIM support.
- Fixed check for whether a checkpoint discard is in progress.
Previously it also waited if the pool had a checkpoint, instead of
just if a checkpoint was being discarded.
- Exposed zfs_initialize_chunk_size as a ZoL-style tunable.
- Updated more existing tests to make use of new 'zpool wait'
functionality, tests that don't exist in Delphix OS.
- Used existing ZoL tunable zfs_scan_suspend_progress, together with
zinject, in place of a new tunable zfs_scan_max_blks_per_txg.
- Added support for a non-integral interval argument to zpool wait.
Future work:
ZoL has support for trimming devices, which Delphix OS does not. In the
future, 'zpool wait' could be extended to add the ability to wait for
trim operations to complete.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #9162
2019-09-14 04:09:06 +03:00
|
|
|
*
|
|
|
|
* There can be multiple devices being replaced at once, but since they
|
|
|
|
* all finish once resilvering finishes, we don't bother keeping track
|
|
|
|
* of them individually, we just wait for them all to finish.
|
|
|
|
*/
|
2020-03-05 02:07:11 +03:00
|
|
|
if (use_tag && activity != ZPOOL_WAIT_INITIALIZE &&
|
|
|
|
activity != ZPOOL_WAIT_TRIM)
|
Add subcommand to wait for background zfs activity to complete
Currently the best way to wait for the completion of a long-running
operation in a pool, like a scrub or device removal, is to poll 'zpool
status' and parse its output, which is neither efficient nor convenient.
This change adds a 'wait' subcommand to the zpool command. When invoked,
'zpool wait' will block until a specified type of background activity
completes. Currently, this subcommand can wait for any of the following:
- Scrubs or resilvers to complete
- Devices to initialized
- Devices to be replaced
- Devices to be removed
- Checkpoints to be discarded
- Background freeing to complete
For example, a scrub that is in progress could be waited for by running
zpool wait -t scrub <pool>
This also adds a -w flag to the attach, checkpoint, initialize, replace,
remove, and scrub subcommands. When used, this flag makes the operations
kicked off by these subcommands synchronous instead of asynchronous.
This functionality is implemented using a new ioctl. The type of
activity to wait for is provided as input to the ioctl, and the ioctl
blocks until all activity of that type has completed. An ioctl was used
over other methods of kernel-userspace communiction primarily for the
sake of portability.
Porting Notes:
This is ported from Delphix OS change DLPX-44432. The following changes
were made while porting:
- Added ZoL-style ioctl input declaration.
- Reorganized error handling in zpool_initialize in libzfs to integrate
better with changes made for TRIM support.
- Fixed check for whether a checkpoint discard is in progress.
Previously it also waited if the pool had a checkpoint, instead of
just if a checkpoint was being discarded.
- Exposed zfs_initialize_chunk_size as a ZoL-style tunable.
- Updated more existing tests to make use of new 'zpool wait'
functionality, tests that don't exist in Delphix OS.
- Used existing ZoL tunable zfs_scan_suspend_progress, together with
zinject, in place of a new tunable zfs_scan_max_blks_per_txg.
- Added support for a non-integral interval argument to zpool wait.
Future work:
ZoL has support for trimming devices, which Delphix OS does not. In the
future, 'zpool wait' could be extended to add the ability to wait for
trim operations to complete.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #9162
2019-09-14 04:09:06 +03:00
|
|
|
return (EINVAL);
|
|
|
|
|
|
|
|
if (activity < 0 || activity >= ZPOOL_WAIT_NUM_ACTIVITIES)
|
|
|
|
return (EINVAL);
|
|
|
|
|
|
|
|
spa_t *spa;
|
|
|
|
int error = spa_open(pool, &spa, FTAG);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Increment the spa's waiter count so that we can call spa_close and
|
|
|
|
* still ensure that the spa_t doesn't get freed before this thread is
|
|
|
|
* finished with it when the pool is exported. We want to call spa_close
|
|
|
|
* before we start waiting because otherwise the additional ref would
|
|
|
|
* prevent the pool from being exported or destroyed throughout the
|
|
|
|
* potentially long wait.
|
|
|
|
*/
|
|
|
|
mutex_enter(&spa->spa_activities_lock);
|
|
|
|
spa->spa_waiters++;
|
|
|
|
spa_close(spa, FTAG);
|
|
|
|
|
|
|
|
*waited = B_FALSE;
|
|
|
|
for (;;) {
|
|
|
|
boolean_t in_progress;
|
|
|
|
error = spa_activity_in_progress(spa, activity, use_tag, tag,
|
|
|
|
&in_progress);
|
|
|
|
|
2019-09-16 20:46:02 +03:00
|
|
|
if (error || !in_progress || spa->spa_waiters_cancel)
|
Add subcommand to wait for background zfs activity to complete
Currently the best way to wait for the completion of a long-running
operation in a pool, like a scrub or device removal, is to poll 'zpool
status' and parse its output, which is neither efficient nor convenient.
This change adds a 'wait' subcommand to the zpool command. When invoked,
'zpool wait' will block until a specified type of background activity
completes. Currently, this subcommand can wait for any of the following:
- Scrubs or resilvers to complete
- Devices to initialized
- Devices to be replaced
- Devices to be removed
- Checkpoints to be discarded
- Background freeing to complete
For example, a scrub that is in progress could be waited for by running
zpool wait -t scrub <pool>
This also adds a -w flag to the attach, checkpoint, initialize, replace,
remove, and scrub subcommands. When used, this flag makes the operations
kicked off by these subcommands synchronous instead of asynchronous.
This functionality is implemented using a new ioctl. The type of
activity to wait for is provided as input to the ioctl, and the ioctl
blocks until all activity of that type has completed. An ioctl was used
over other methods of kernel-userspace communiction primarily for the
sake of portability.
Porting Notes:
This is ported from Delphix OS change DLPX-44432. The following changes
were made while porting:
- Added ZoL-style ioctl input declaration.
- Reorganized error handling in zpool_initialize in libzfs to integrate
better with changes made for TRIM support.
- Fixed check for whether a checkpoint discard is in progress.
Previously it also waited if the pool had a checkpoint, instead of
just if a checkpoint was being discarded.
- Exposed zfs_initialize_chunk_size as a ZoL-style tunable.
- Updated more existing tests to make use of new 'zpool wait'
functionality, tests that don't exist in Delphix OS.
- Used existing ZoL tunable zfs_scan_suspend_progress, together with
zinject, in place of a new tunable zfs_scan_max_blks_per_txg.
- Added support for a non-integral interval argument to zpool wait.
Future work:
ZoL has support for trimming devices, which Delphix OS does not. In the
future, 'zpool wait' could be extended to add the ability to wait for
trim operations to complete.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #9162
2019-09-14 04:09:06 +03:00
|
|
|
break;
|
|
|
|
|
|
|
|
*waited = B_TRUE;
|
|
|
|
|
|
|
|
if (cv_wait_sig(&spa->spa_activities_cv,
|
|
|
|
&spa->spa_activities_lock) == 0) {
|
|
|
|
error = EINTR;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
spa->spa_waiters--;
|
|
|
|
cv_signal(&spa->spa_waiters_cv);
|
|
|
|
mutex_exit(&spa->spa_activities_lock);
|
|
|
|
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Wait for a particular instance of the specified activity to complete, where
|
|
|
|
* the instance is identified by 'tag'
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
spa_wait_tag(const char *pool, zpool_wait_activity_t activity, uint64_t tag,
|
|
|
|
boolean_t *waited)
|
|
|
|
{
|
|
|
|
return (spa_wait_common(pool, activity, B_TRUE, tag, waited));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Wait for all instances of the specified activity complete
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
spa_wait(const char *pool, zpool_wait_activity_t activity, boolean_t *waited)
|
|
|
|
{
|
|
|
|
|
|
|
|
return (spa_wait_common(pool, activity, B_FALSE, 0, waited));
|
|
|
|
}
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
sysevent_t *
|
2017-05-30 21:39:17 +03:00
|
|
|
spa_event_create(spa_t *spa, vdev_t *vd, nvlist_t *hist_nvl, const char *name)
|
|
|
|
{
|
|
|
|
sysevent_t *ev = NULL;
|
|
|
|
#ifdef _KERNEL
|
|
|
|
nvlist_t *resource;
|
|
|
|
|
|
|
|
resource = zfs_event_create(spa, vd, FM_SYSEVENT_CLASS, name, hist_nvl);
|
|
|
|
if (resource) {
|
|
|
|
ev = kmem_alloc(sizeof (sysevent_t), KM_SLEEP);
|
|
|
|
ev->resource = resource;
|
|
|
|
}
|
2021-12-12 18:06:44 +03:00
|
|
|
#else
|
|
|
|
(void) spa, (void) vd, (void) hist_nvl, (void) name;
|
2017-05-30 21:39:17 +03:00
|
|
|
#endif
|
|
|
|
return (ev);
|
|
|
|
}
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
void
|
2017-05-30 21:39:17 +03:00
|
|
|
spa_event_post(sysevent_t *ev)
|
|
|
|
{
|
|
|
|
#ifdef _KERNEL
|
|
|
|
if (ev) {
|
|
|
|
zfs_zevent_post(ev->resource, NULL, zfs_zevent_post_cb);
|
|
|
|
kmem_free(ev, sizeof (*ev));
|
|
|
|
}
|
2021-12-12 18:06:44 +03:00
|
|
|
#else
|
|
|
|
(void) ev;
|
2017-05-30 21:39:17 +03:00
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2016-07-28 01:29:15 +03:00
|
|
|
* Post a zevent corresponding to the given sysevent. The 'name' must be one
|
|
|
|
* of the event definitions in sys/sysevent/eventdefs.h. The payload will be
|
2008-11-20 23:01:55 +03:00
|
|
|
* filled in from the spa and (optionally) the vdev. This doesn't do anything
|
|
|
|
* in the userland libzpool, as we don't want consumers to misinterpret ztest
|
|
|
|
* or zdb as real changes.
|
|
|
|
*/
|
|
|
|
void
|
2017-05-30 21:39:17 +03:00
|
|
|
spa_event_notify(spa_t *spa, vdev_t *vd, nvlist_t *hist_nvl, const char *name)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2017-05-30 21:39:17 +03:00
|
|
|
spa_event_post(spa_event_create(spa, vd, hist_nvl, name));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-08-26 22:49:16 +04:00
|
|
|
|
|
|
|
/* state manipulation functions */
|
|
|
|
EXPORT_SYMBOL(spa_open);
|
|
|
|
EXPORT_SYMBOL(spa_open_rewind);
|
|
|
|
EXPORT_SYMBOL(spa_get_stats);
|
|
|
|
EXPORT_SYMBOL(spa_create);
|
|
|
|
EXPORT_SYMBOL(spa_import);
|
|
|
|
EXPORT_SYMBOL(spa_tryimport);
|
|
|
|
EXPORT_SYMBOL(spa_destroy);
|
|
|
|
EXPORT_SYMBOL(spa_export);
|
|
|
|
EXPORT_SYMBOL(spa_reset);
|
|
|
|
EXPORT_SYMBOL(spa_async_request);
|
|
|
|
EXPORT_SYMBOL(spa_async_suspend);
|
|
|
|
EXPORT_SYMBOL(spa_async_resume);
|
|
|
|
EXPORT_SYMBOL(spa_inject_addref);
|
|
|
|
EXPORT_SYMBOL(spa_inject_delref);
|
|
|
|
EXPORT_SYMBOL(spa_scan_stat_init);
|
|
|
|
EXPORT_SYMBOL(spa_scan_get_stats);
|
|
|
|
|
2019-09-03 03:56:41 +03:00
|
|
|
/* device manipulation */
|
2010-08-26 22:49:16 +04:00
|
|
|
EXPORT_SYMBOL(spa_vdev_add);
|
|
|
|
EXPORT_SYMBOL(spa_vdev_attach);
|
|
|
|
EXPORT_SYMBOL(spa_vdev_detach);
|
|
|
|
EXPORT_SYMBOL(spa_vdev_setpath);
|
|
|
|
EXPORT_SYMBOL(spa_vdev_setfru);
|
|
|
|
EXPORT_SYMBOL(spa_vdev_split_mirror);
|
|
|
|
|
|
|
|
/* spare statech is global across all pools) */
|
|
|
|
EXPORT_SYMBOL(spa_spare_add);
|
|
|
|
EXPORT_SYMBOL(spa_spare_remove);
|
|
|
|
EXPORT_SYMBOL(spa_spare_exists);
|
|
|
|
EXPORT_SYMBOL(spa_spare_activate);
|
|
|
|
|
|
|
|
/* L2ARC statech is global across all pools) */
|
|
|
|
EXPORT_SYMBOL(spa_l2cache_add);
|
|
|
|
EXPORT_SYMBOL(spa_l2cache_remove);
|
|
|
|
EXPORT_SYMBOL(spa_l2cache_exists);
|
|
|
|
EXPORT_SYMBOL(spa_l2cache_activate);
|
|
|
|
EXPORT_SYMBOL(spa_l2cache_drop);
|
|
|
|
|
|
|
|
/* scanning */
|
|
|
|
EXPORT_SYMBOL(spa_scan);
|
|
|
|
EXPORT_SYMBOL(spa_scan_stop);
|
|
|
|
|
|
|
|
/* spa syncing */
|
|
|
|
EXPORT_SYMBOL(spa_sync); /* only for DMU use */
|
|
|
|
EXPORT_SYMBOL(spa_sync_allpools);
|
|
|
|
|
|
|
|
/* properties */
|
|
|
|
EXPORT_SYMBOL(spa_prop_set);
|
|
|
|
EXPORT_SYMBOL(spa_prop_get);
|
|
|
|
EXPORT_SYMBOL(spa_prop_clear_bootfs);
|
|
|
|
|
|
|
|
/* asynchronous event notification */
|
|
|
|
EXPORT_SYMBOL(spa_event_notify);
|
2014-07-15 22:58:41 +04:00
|
|
|
|
2019-08-13 17:11:57 +03:00
|
|
|
/* BEGIN CSTYLED */
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_spa, spa_, load_verify_shift, UINT, ZMOD_RW,
|
2021-04-12 02:35:07 +03:00
|
|
|
"log2 fraction of arc that can be used by inflight I/Os when "
|
2019-09-06 00:49:49 +03:00
|
|
|
"verifying pool during import");
|
2022-01-21 19:07:15 +03:00
|
|
|
/* END CSTYLED */
|
2014-07-15 22:58:41 +04:00
|
|
|
|
2019-09-06 00:49:49 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_spa, spa_, load_verify_metadata, INT, ZMOD_RW,
|
2014-07-15 22:58:41 +04:00
|
|
|
"Set to traverse metadata on pool import");
|
|
|
|
|
2019-09-06 00:49:49 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_spa, spa_, load_verify_data, INT, ZMOD_RW,
|
2014-07-15 22:58:41 +04:00
|
|
|
"Set to traverse data on pool import");
|
2015-12-16 22:22:32 +03:00
|
|
|
|
2019-09-06 00:49:49 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_spa, spa_, load_print_vdev_tree, INT, ZMOD_RW,
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
"Print vdev tree to zfs_dbgmsg during pool import");
|
|
|
|
|
2019-09-06 00:49:49 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_zio, zio_, taskq_batch_pct, UINT, ZMOD_RD,
|
2015-12-16 22:22:32 +03:00
|
|
|
"Percentage of CPUs to run an IO worker thread");
|
|
|
|
|
Scale worker threads and taskqs with number of CPUs
While use of dynamic taskqs allows to reduce number of idle threads,
hardcoded 8 taskqs of each kind is a big overkill for small systems,
complicating CPU scheduling, increasing I/O reorder, etc, while
providing no real locking benefits, just not needed there.
On another side, 12*8 worker threads per kind are able to overload
almost any system nowadays. For example, pool of several fast SSDs
with SHA256 checksum makes system barely responsive during scrub, or
with dedup enabled barely responsive during large file deletion.
To address both problems this patch introduces ZTI_SCALE macro, alike
to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs,
to be used in places where lock scalability is needed, while request
ordering is not so much. The code is made to create new taskq for
~6 worker threads (less for small systems, but more for very large)
up to 80% of CPU cores (previous 75% was not good for rounding down).
Both number of threads and threads per taskq are now tunable in case
somebody really wants to use all of system power for ZFS.
While obviously some benchmarks show small peak performance reduction
(not so big really, especially on systems with SMT, where use of the
second threads does not give as much performance as the first ones),
they also show dramatic latency reduction and much more smooth user-
space operation in case of high CPU usage by ZFS.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #11966
2021-05-14 19:13:53 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_zio, zio_, taskq_batch_tpq, UINT, ZMOD_RD,
|
|
|
|
"Number of threads per IO worker taskqueue");
|
|
|
|
|
2022-01-21 19:07:15 +03:00
|
|
|
/* BEGIN CSTYLED */
|
2019-09-06 00:49:49 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs, zfs_, max_missing_tvds, ULONG, ZMOD_RW,
|
|
|
|
"Allow importing pool with up to this number of missing top-level "
|
|
|
|
"vdevs (in read-only mode)");
|
2022-01-21 19:07:15 +03:00
|
|
|
/* END CSTYLED */
|
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2016-07-22 17:39:36 +03:00
|
|
|
|
2022-01-21 19:07:15 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_livelist_condense, zfs_livelist_condense_, zthr_pause, INT,
|
|
|
|
ZMOD_RW, "Set the livelist condense zthr to pause");
|
2019-09-06 00:49:49 +03:00
|
|
|
|
2022-01-21 19:07:15 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_livelist_condense, zfs_livelist_condense_, sync_pause, INT,
|
|
|
|
ZMOD_RW, "Set the livelist condense synctask to pause");
|
2019-07-26 20:54:14 +03:00
|
|
|
|
2022-01-21 19:07:15 +03:00
|
|
|
/* BEGIN CSTYLED */
|
|
|
|
ZFS_MODULE_PARAM(zfs_livelist_condense, zfs_livelist_condense_, sync_cancel,
|
|
|
|
INT, ZMOD_RW,
|
2019-07-26 20:54:14 +03:00
|
|
|
"Whether livelist condensing was canceled in the synctask");
|
2019-09-06 00:49:49 +03:00
|
|
|
|
2022-01-21 19:07:15 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_livelist_condense, zfs_livelist_condense_, zthr_cancel,
|
|
|
|
INT, ZMOD_RW,
|
2019-07-26 20:54:14 +03:00
|
|
|
"Whether livelist condensing was canceled in the zthr function");
|
|
|
|
|
2022-01-21 19:07:15 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_livelist_condense, zfs_livelist_condense_, new_alloc, INT,
|
|
|
|
ZMOD_RW,
|
2019-09-06 00:49:49 +03:00
|
|
|
"Whether extra ALLOC blkptrs were added to a livelist entry while it "
|
|
|
|
"was being condensed");
|
2019-07-26 20:54:14 +03:00
|
|
|
/* END CSTYLED */
|