2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* CDDL HEADER START
|
|
|
|
*
|
|
|
|
* The contents of this file are subject to the terms of the
|
|
|
|
* Common Development and Distribution License (the "License").
|
|
|
|
* You may not use this file except in compliance with the License.
|
|
|
|
*
|
|
|
|
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
|
2022-07-12 00:16:13 +03:00
|
|
|
* or https://opensource.org/licenses/CDDL-1.0.
|
2008-11-20 23:01:55 +03:00
|
|
|
* See the License for the specific language governing permissions
|
|
|
|
* and limitations under the License.
|
|
|
|
*
|
|
|
|
* When distributing Covered Code, include this CDDL HEADER in each
|
|
|
|
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
|
|
|
|
* If applicable, add the following below this CDDL HEADER, with the
|
|
|
|
* fields enclosed by brackets "[]" replaced with your own identifying
|
|
|
|
* information: Portions Copyright [yyyy] [name of copyright owner]
|
|
|
|
*
|
|
|
|
* CDDL HEADER END
|
|
|
|
*/
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Copyright 2010 Sun Microsystems, Inc. All rights reserved.
|
2008-11-20 23:01:55 +03:00
|
|
|
* Use is subject to license terms.
|
|
|
|
*/
|
2012-12-15 04:13:40 +04:00
|
|
|
/*
|
2017-04-24 19:34:36 +03:00
|
|
|
* Copyright (c) 2012, 2017 by Delphix. All rights reserved.
|
2012-12-15 04:13:40 +04:00
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
#ifndef _SYS_TXG_H
|
|
|
|
#define _SYS_TXG_H
|
|
|
|
|
|
|
|
#include <sys/spa.h>
|
|
|
|
#include <sys/zfs_context.h>
|
|
|
|
|
|
|
|
#ifdef __cplusplus
|
|
|
|
extern "C" {
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#define TXG_CONCURRENT_STATES 3 /* open, quiescing, syncing */
|
|
|
|
#define TXG_SIZE 4 /* next power of 2 */
|
|
|
|
#define TXG_MASK (TXG_SIZE - 1) /* mask for size */
|
|
|
|
#define TXG_INITIAL TXG_SIZE /* initial txg */
|
|
|
|
#define TXG_IDX (txg & TXG_MASK)
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
#define TXG_UNKNOWN 0
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/* Number of txgs worth of frees we defer adding to in-core spacemaps */
|
|
|
|
#define TXG_DEFER_SIZE 2
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
typedef struct tx_cpu tx_cpu_t;
|
|
|
|
|
|
|
|
typedef struct txg_handle {
|
|
|
|
tx_cpu_t *th_cpu;
|
|
|
|
uint64_t th_txg;
|
|
|
|
} txg_handle_t;
|
|
|
|
|
|
|
|
typedef struct txg_node {
|
|
|
|
struct txg_node *tn_next[TXG_SIZE];
|
|
|
|
uint8_t tn_member[TXG_SIZE];
|
|
|
|
} txg_node_t;
|
|
|
|
|
|
|
|
typedef struct txg_list {
|
|
|
|
kmutex_t tl_lock;
|
|
|
|
size_t tl_offset;
|
2017-04-24 19:34:36 +03:00
|
|
|
spa_t *tl_spa;
|
2008-11-20 23:01:55 +03:00
|
|
|
txg_node_t *tl_head[TXG_SIZE];
|
|
|
|
} txg_list_t;
|
|
|
|
|
|
|
|
struct dsl_pool;
|
|
|
|
|
|
|
|
extern void txg_init(struct dsl_pool *dp, uint64_t txg);
|
|
|
|
extern void txg_fini(struct dsl_pool *dp);
|
|
|
|
extern void txg_sync_start(struct dsl_pool *dp);
|
|
|
|
extern void txg_sync_stop(struct dsl_pool *dp);
|
|
|
|
extern uint64_t txg_hold_open(struct dsl_pool *dp, txg_handle_t *txghp);
|
|
|
|
extern void txg_rele_to_quiesce(txg_handle_t *txghp);
|
|
|
|
extern void txg_rele_to_sync(txg_handle_t *txghp);
|
2010-05-29 00:45:14 +04:00
|
|
|
extern void txg_register_callbacks(txg_handle_t *txghp, list_t *tx_callbacks);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-08-29 03:05:48 +04:00
|
|
|
extern void txg_delay(struct dsl_pool *dp, uint64_t txg, hrtime_t delta,
|
|
|
|
hrtime_t resolution);
|
2021-07-01 18:20:27 +03:00
|
|
|
extern void txg_kick(struct dsl_pool *dp, uint64_t txg);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Wait until the given transaction group has finished syncing.
|
|
|
|
* Try to make this happen as soon as possible (eg. kick off any
|
|
|
|
* necessary syncs immediately). If txg==0, wait for the currently open
|
|
|
|
* txg to finish syncing.
|
|
|
|
*/
|
|
|
|
extern void txg_wait_synced(struct dsl_pool *dp, uint64_t txg);
|
|
|
|
|
2019-06-23 02:51:46 +03:00
|
|
|
/*
|
|
|
|
* Wait as above. Returns true if the thread was signaled while waiting.
|
|
|
|
*/
|
|
|
|
extern boolean_t txg_wait_synced_sig(struct dsl_pool *dp, uint64_t txg);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Wait until the given transaction group, or one after it, is
|
|
|
|
* the open transaction group. Try to make this happen as soon
|
2019-03-29 19:13:20 +03:00
|
|
|
* as possible (eg. kick off any necessary syncs immediately) when
|
|
|
|
* should_quiesce is set. If txg == 0, wait for the next open txg.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2019-03-29 19:13:20 +03:00
|
|
|
extern void txg_wait_open(struct dsl_pool *dp, uint64_t txg,
|
|
|
|
boolean_t should_quiesce);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Returns TRUE if we are "backed up" waiting for the syncing
|
|
|
|
* transaction to complete; otherwise returns FALSE.
|
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
extern boolean_t txg_stalled(struct dsl_pool *dp);
|
|
|
|
|
|
|
|
/* returns TRUE if someone is waiting for the next txg to sync */
|
|
|
|
extern boolean_t txg_sync_waiting(struct dsl_pool *dp);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-04-24 19:34:36 +03:00
|
|
|
extern void txg_verify(spa_t *spa, uint64_t txg);
|
|
|
|
|
2011-01-22 01:35:41 +03:00
|
|
|
/*
|
|
|
|
* Wait for pending commit callbacks of already-synced transactions to finish
|
|
|
|
* processing.
|
|
|
|
*/
|
|
|
|
extern void txg_wait_callbacks(struct dsl_pool *dp);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Per-txg object lists.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#define TXG_CLEAN(txg) ((txg) - 1)
|
|
|
|
|
2017-04-24 19:34:36 +03:00
|
|
|
extern void txg_list_create(txg_list_t *tl, spa_t *spa, size_t offset);
|
2008-11-20 23:01:55 +03:00
|
|
|
extern void txg_list_destroy(txg_list_t *tl);
|
2012-12-15 04:13:40 +04:00
|
|
|
extern boolean_t txg_list_empty(txg_list_t *tl, uint64_t txg);
|
2014-07-18 19:08:31 +04:00
|
|
|
extern boolean_t txg_all_lists_empty(txg_list_t *tl);
|
2013-09-04 16:00:57 +04:00
|
|
|
extern boolean_t txg_list_add(txg_list_t *tl, void *p, uint64_t txg);
|
|
|
|
extern boolean_t txg_list_add_tail(txg_list_t *tl, void *p, uint64_t txg);
|
2008-11-20 23:01:55 +03:00
|
|
|
extern void *txg_list_remove(txg_list_t *tl, uint64_t txg);
|
|
|
|
extern void *txg_list_remove_this(txg_list_t *tl, void *p, uint64_t txg);
|
2013-09-04 16:00:57 +04:00
|
|
|
extern boolean_t txg_list_member(txg_list_t *tl, void *p, uint64_t txg);
|
2008-11-20 23:01:55 +03:00
|
|
|
extern void *txg_list_head(txg_list_t *tl, uint64_t txg);
|
|
|
|
extern void *txg_list_next(txg_list_t *tl, void *p, uint64_t txg);
|
|
|
|
|
2012-10-12 00:56:32 +04:00
|
|
|
/* Global tuning */
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
extern uint_t zfs_txg_timeout;
|
2012-10-12 00:56:32 +04:00
|
|
|
|
2018-08-20 23:41:53 +03:00
|
|
|
|
|
|
|
#ifdef ZFS_DEBUG
|
|
|
|
#define TXG_VERIFY(spa, txg) txg_verify(spa, txg)
|
|
|
|
#else
|
|
|
|
#define TXG_VERIFY(spa, txg)
|
|
|
|
#endif
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
#ifdef __cplusplus
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#endif /* _SYS_TXG_H */
|