2020-04-14 21:36:28 +03:00
|
|
|
/*
|
|
|
|
* CDDL HEADER START
|
|
|
|
*
|
|
|
|
* The contents of this file are subject to the terms of the
|
|
|
|
* Common Development and Distribution License (the "License").
|
|
|
|
* You may not use this file except in compliance with the License.
|
|
|
|
*
|
|
|
|
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
|
2022-07-12 00:16:13 +03:00
|
|
|
* or https://opensource.org/licenses/CDDL-1.0.
|
2020-04-14 21:36:28 +03:00
|
|
|
* See the License for the specific language governing permissions
|
|
|
|
* and limitations under the License.
|
|
|
|
*
|
|
|
|
* When distributing Covered Code, include this CDDL HEADER in each
|
|
|
|
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
|
|
|
|
* If applicable, add the following below this CDDL HEADER, with the
|
|
|
|
* fields enclosed by brackets "[]" replaced with your own identifying
|
|
|
|
* information: Portions Copyright [yyyy] [name of copyright owner]
|
|
|
|
*
|
|
|
|
* CDDL HEADER END
|
|
|
|
*/
|
|
|
|
/*
|
|
|
|
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
|
2020-09-18 22:13:47 +03:00
|
|
|
* Copyright (c) 2011, 2020 by Delphix. All rights reserved.
|
2020-04-14 21:36:28 +03:00
|
|
|
*/
|
|
|
|
|
|
|
|
#include <sys/zfs_context.h>
|
|
|
|
#include <sys/spa.h>
|
2020-07-01 19:10:08 +03:00
|
|
|
#include <sys/file.h>
|
2020-04-14 21:36:28 +03:00
|
|
|
#include <sys/vdev_file.h>
|
|
|
|
#include <sys/vdev_impl.h>
|
|
|
|
#include <sys/zio.h>
|
|
|
|
#include <sys/fs/zfs.h>
|
|
|
|
#include <sys/fm/fs/zfs.h>
|
|
|
|
#include <sys/abd.h>
|
|
|
|
#include <sys/stat.h>
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Virtual device vector for files.
|
|
|
|
*/
|
|
|
|
|
|
|
|
static taskq_t *vdev_file_taskq;
|
|
|
|
|
Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
* dbuf_cache_max_bytes
* dbuf_metadata_cache_max_bytes
* l2arc_feed_min_ms
* l2arc_feed_secs
* l2arc_headroom
* l2arc_headroom_boost
* l2arc_write_boost
* l2arc_write_max
* metaslab_aliquot
* metaslab_force_ganging
* zfetch_array_rd_sz
* zfs_arc_max
* zfs_arc_meta_limit
* zfs_arc_meta_min
* zfs_arc_min
* zfs_async_block_max_blocks
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
* zfs_deadman_checktime_ms
* zfs_deadman_synctime_ms
* zfs_initialize_chunk_size
* zfs_initialize_value
* zfs_lua_max_instrlimit
* zfs_lua_max_memlimit
* zil_slog_bulk
Parameters that were originally 32-bit on Illumos:
* zfs_per_txg_dirty_frees_percent
Parameters that were originally `ssize_t` on Illumos:
* zfs_immediate_write_sz
Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.
Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:
* l2arc_rebuild_blocks_min_l2size
* zfs_key_max_salt_uses
* zfs_max_log_walking
* zfs_max_logsm_summary_length
* zfs_metaslab_max_size_cache_sec
* zfs_min_metaslabs_to_flush
* zfs_multihost_interval
* zfs_unflushed_log_block_max
* zfs_unflushed_log_block_min
* zfs_unflushed_log_block_pct
* zfs_unflushed_max_mem_amt
* zfs_unflushed_max_mem_ppm
New parameters that do not exist in Illumos:
* l2arc_trim_ahead
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_arc_sys_free
* zfs_deadman_ziotime_ms
* zfs_delete_blocks
* zfs_history_output_max
* zfs_livelist_max_entries
* zfs_max_async_dedup_frees
* zfs_max_nvlist_src_size
* zfs_rebuild_max_segment
* zfs_rebuild_vdev_limit
* zfs_unflushed_log_txg_max
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
* zfs_vnops_read_chunk_size
* zvol_max_discard_blocks
Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.
A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:
* zfs_delete_blocks
* zfs_key_max_salt_uses
* zvol_max_discard_blocks
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_per_txg_dirty_frees_percent
* zfs_unflushed_log_block_pct
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
2022-10-03 22:06:54 +03:00
|
|
|
static uint_t vdev_file_logical_ashift = SPA_MINBLOCKSHIFT;
|
|
|
|
static uint_t vdev_file_physical_ashift = SPA_MINBLOCKSHIFT;
|
2020-09-18 22:13:47 +03:00
|
|
|
|
2020-04-14 21:36:28 +03:00
|
|
|
void
|
|
|
|
vdev_file_init(void)
|
|
|
|
{
|
|
|
|
vdev_file_taskq = taskq_create("z_vdev_file", MAX(max_ncpus, 16),
|
|
|
|
minclsyspri, max_ncpus, INT_MAX, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
vdev_file_fini(void)
|
|
|
|
{
|
|
|
|
taskq_destroy(vdev_file_taskq);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
vdev_file_hold(vdev_t *vd)
|
|
|
|
{
|
2021-05-01 02:36:10 +03:00
|
|
|
ASSERT3P(vd->vdev_path, !=, NULL);
|
2020-04-14 21:36:28 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
vdev_file_rele(vdev_t *vd)
|
|
|
|
{
|
2021-05-01 02:36:10 +03:00
|
|
|
ASSERT3P(vd->vdev_path, !=, NULL);
|
2020-04-14 21:36:28 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static mode_t
|
|
|
|
vdev_file_open_mode(spa_mode_t spa_mode)
|
|
|
|
{
|
|
|
|
mode_t mode = 0;
|
|
|
|
|
|
|
|
if ((spa_mode & SPA_MODE_READ) && (spa_mode & SPA_MODE_WRITE)) {
|
|
|
|
mode = O_RDWR;
|
|
|
|
} else if (spa_mode & SPA_MODE_READ) {
|
|
|
|
mode = O_RDONLY;
|
|
|
|
} else if (spa_mode & SPA_MODE_WRITE) {
|
|
|
|
mode = O_WRONLY;
|
|
|
|
}
|
|
|
|
|
|
|
|
return (mode | O_LARGEFILE);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
vdev_file_open(vdev_t *vd, uint64_t *psize, uint64_t *max_psize,
|
2020-08-21 22:53:17 +03:00
|
|
|
uint64_t *logical_ashift, uint64_t *physical_ashift)
|
2020-04-14 21:36:28 +03:00
|
|
|
{
|
|
|
|
vdev_file_t *vf;
|
|
|
|
zfs_file_t *fp;
|
|
|
|
zfs_file_attr_t zfa;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Rotational optimizations only make sense on block devices.
|
|
|
|
*/
|
|
|
|
vd->vdev_nonrot = B_TRUE;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Allow TRIM on file based vdevs. This may not always be supported,
|
|
|
|
* since it depends on your kernel version and underlying filesystem
|
|
|
|
* type but it is always safe to attempt.
|
|
|
|
*/
|
|
|
|
vd->vdev_has_trim = B_TRUE;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Disable secure TRIM on file based vdevs. There is no way to
|
|
|
|
* request this behavior from the underlying filesystem.
|
|
|
|
*/
|
|
|
|
vd->vdev_has_securetrim = B_FALSE;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We must have a pathname, and it must be absolute.
|
|
|
|
*/
|
|
|
|
if (vd->vdev_path == NULL || vd->vdev_path[0] != '/') {
|
|
|
|
vd->vdev_stat.vs_aux = VDEV_AUX_BAD_LABEL;
|
|
|
|
return (SET_ERROR(EINVAL));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Reopen the device if it's not currently open. Otherwise,
|
|
|
|
* just update the physical size of the device.
|
|
|
|
*/
|
|
|
|
if (vd->vdev_tsd != NULL) {
|
|
|
|
ASSERT(vd->vdev_reopening);
|
|
|
|
vf = vd->vdev_tsd;
|
|
|
|
goto skip_open;
|
|
|
|
}
|
|
|
|
|
|
|
|
vf = vd->vdev_tsd = kmem_zalloc(sizeof (vdev_file_t), KM_SLEEP);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We always open the files from the root of the global zone, even if
|
|
|
|
* we're in a local zone. If the user has gotten to this point, the
|
|
|
|
* administrator has already decided that the pool should be available
|
|
|
|
* to local zone users, so the underlying devices should be as well.
|
|
|
|
*/
|
2021-05-01 02:36:10 +03:00
|
|
|
ASSERT3P(vd->vdev_path, !=, NULL);
|
|
|
|
ASSERT(vd->vdev_path[0] == '/');
|
2020-04-14 21:36:28 +03:00
|
|
|
|
|
|
|
error = zfs_file_open(vd->vdev_path,
|
|
|
|
vdev_file_open_mode(spa_mode(vd->vdev_spa)), 0, &fp);
|
|
|
|
if (error) {
|
|
|
|
vd->vdev_stat.vs_aux = VDEV_AUX_OPEN_FAILED;
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
vf->vf_file = fp;
|
|
|
|
|
|
|
|
#ifdef _KERNEL
|
|
|
|
/*
|
|
|
|
* Make sure it's a regular file.
|
|
|
|
*/
|
|
|
|
if (zfs_file_getattr(fp, &zfa)) {
|
|
|
|
return (SET_ERROR(ENODEV));
|
|
|
|
}
|
|
|
|
if (!S_ISREG(zfa.zfa_mode)) {
|
|
|
|
vd->vdev_stat.vs_aux = VDEV_AUX_OPEN_FAILED;
|
|
|
|
return (SET_ERROR(ENODEV));
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
skip_open:
|
|
|
|
|
|
|
|
error = zfs_file_getattr(vf->vf_file, &zfa);
|
|
|
|
if (error) {
|
|
|
|
vd->vdev_stat.vs_aux = VDEV_AUX_OPEN_FAILED;
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
*max_psize = *psize = zfa.zfa_size;
|
2020-09-18 22:13:47 +03:00
|
|
|
*logical_ashift = vdev_file_logical_ashift;
|
|
|
|
*physical_ashift = vdev_file_physical_ashift;
|
2020-04-14 21:36:28 +03:00
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
vdev_file_close(vdev_t *vd)
|
|
|
|
{
|
|
|
|
vdev_file_t *vf = vd->vdev_tsd;
|
|
|
|
|
|
|
|
if (vd->vdev_reopening || vf == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (vf->vf_file != NULL) {
|
|
|
|
zfs_file_close(vf->vf_file);
|
|
|
|
}
|
|
|
|
|
|
|
|
vd->vdev_delayed_close = B_FALSE;
|
|
|
|
kmem_free(vf, sizeof (vdev_file_t));
|
|
|
|
vd->vdev_tsd = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Implements the interrupt side for file vdev types. This routine will be
|
|
|
|
* called when the I/O completes allowing us to transfer the I/O to the
|
|
|
|
* interrupt taskqs. For consistency, the code structure mimics disk vdev
|
|
|
|
* types.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
vdev_file_io_intr(zio_t *zio)
|
|
|
|
{
|
|
|
|
zio_delay_interrupt(zio);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
vdev_file_io_strategy(void *arg)
|
|
|
|
{
|
|
|
|
zio_t *zio = arg;
|
|
|
|
vdev_t *vd = zio->io_vd;
|
|
|
|
vdev_file_t *vf;
|
|
|
|
void *buf;
|
|
|
|
ssize_t resid;
|
|
|
|
loff_t off;
|
|
|
|
ssize_t size;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
off = zio->io_offset;
|
|
|
|
size = zio->io_size;
|
|
|
|
resid = 0;
|
|
|
|
|
|
|
|
vf = vd->vdev_tsd;
|
|
|
|
|
|
|
|
ASSERT(zio->io_type == ZIO_TYPE_READ || zio->io_type == ZIO_TYPE_WRITE);
|
|
|
|
if (zio->io_type == ZIO_TYPE_READ) {
|
|
|
|
buf = abd_borrow_buf(zio->io_abd, zio->io_size);
|
|
|
|
err = zfs_file_pread(vf->vf_file, buf, size, off, &resid);
|
|
|
|
abd_return_buf_copy(zio->io_abd, buf, size);
|
|
|
|
} else {
|
|
|
|
buf = abd_borrow_buf_copy(zio->io_abd, zio->io_size);
|
|
|
|
err = zfs_file_pwrite(vf->vf_file, buf, size, off, &resid);
|
|
|
|
abd_return_buf(zio->io_abd, buf, size);
|
|
|
|
}
|
2021-12-23 22:39:29 +03:00
|
|
|
zio->io_error = err;
|
2020-04-14 21:36:28 +03:00
|
|
|
if (resid != 0 && zio->io_error == 0)
|
|
|
|
zio->io_error = ENOSPC;
|
|
|
|
|
|
|
|
vdev_file_io_intr(zio);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
vdev_file_io_start(zio_t *zio)
|
|
|
|
{
|
|
|
|
vdev_t *vd = zio->io_vd;
|
|
|
|
vdev_file_t *vf = vd->vdev_tsd;
|
|
|
|
|
|
|
|
if (zio->io_type == ZIO_TYPE_IOCTL) {
|
|
|
|
/* XXPOLICY */
|
|
|
|
if (!vdev_readable(vd)) {
|
|
|
|
zio->io_error = SET_ERROR(ENXIO);
|
|
|
|
zio_interrupt(zio);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
switch (zio->io_cmd) {
|
|
|
|
case DKIOCFLUSHWRITECACHE:
|
|
|
|
zio->io_error = zfs_file_fsync(vf->vf_file,
|
|
|
|
O_SYNC|O_DSYNC);
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
zio->io_error = SET_ERROR(ENOTSUP);
|
|
|
|
}
|
|
|
|
|
|
|
|
zio_execute(zio);
|
|
|
|
return;
|
|
|
|
} else if (zio->io_type == ZIO_TYPE_TRIM) {
|
|
|
|
#ifdef notyet
|
|
|
|
int mode = 0;
|
|
|
|
|
|
|
|
ASSERT3U(zio->io_size, !=, 0);
|
|
|
|
|
|
|
|
/* XXX FreeBSD has no fallocate routine in file ops */
|
|
|
|
zio->io_error = zfs_file_fallocate(vf->vf_file,
|
|
|
|
mode, zio->io_offset, zio->io_size);
|
|
|
|
#endif
|
|
|
|
zio->io_error = SET_ERROR(ENOTSUP);
|
|
|
|
zio_execute(zio);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
ASSERT(zio->io_type == ZIO_TYPE_READ || zio->io_type == ZIO_TYPE_WRITE);
|
|
|
|
zio->io_target_timestamp = zio_handle_io_delay(zio);
|
|
|
|
|
|
|
|
VERIFY3U(taskq_dispatch(vdev_file_taskq, vdev_file_io_strategy, zio,
|
|
|
|
TQ_SLEEP), !=, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
vdev_file_io_done(zio_t *zio)
|
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) zio;
|
2020-04-14 21:36:28 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
vdev_ops_t vdev_file_ops = {
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
.vdev_op_init = NULL,
|
|
|
|
.vdev_op_fini = NULL,
|
|
|
|
.vdev_op_open = vdev_file_open,
|
|
|
|
.vdev_op_close = vdev_file_close,
|
|
|
|
.vdev_op_asize = vdev_default_asize,
|
|
|
|
.vdev_op_min_asize = vdev_default_min_asize,
|
|
|
|
.vdev_op_min_alloc = NULL,
|
|
|
|
.vdev_op_io_start = vdev_file_io_start,
|
|
|
|
.vdev_op_io_done = vdev_file_io_done,
|
|
|
|
.vdev_op_state_change = NULL,
|
|
|
|
.vdev_op_need_resilver = NULL,
|
|
|
|
.vdev_op_hold = vdev_file_hold,
|
|
|
|
.vdev_op_rele = vdev_file_rele,
|
|
|
|
.vdev_op_remap = NULL,
|
|
|
|
.vdev_op_xlate = vdev_default_xlate,
|
|
|
|
.vdev_op_rebuild_asize = NULL,
|
|
|
|
.vdev_op_metaslab_init = NULL,
|
|
|
|
.vdev_op_config_generate = NULL,
|
|
|
|
.vdev_op_nparity = NULL,
|
|
|
|
.vdev_op_ndisks = NULL,
|
|
|
|
.vdev_op_type = VDEV_TYPE_FILE, /* name of this vdev type */
|
|
|
|
.vdev_op_leaf = B_TRUE /* leaf vdev */
|
2020-04-14 21:36:28 +03:00
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* From userland we access disks just like files.
|
|
|
|
*/
|
|
|
|
#ifndef _KERNEL
|
|
|
|
|
|
|
|
vdev_ops_t vdev_disk_ops = {
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
.vdev_op_init = NULL,
|
|
|
|
.vdev_op_fini = NULL,
|
|
|
|
.vdev_op_open = vdev_file_open,
|
|
|
|
.vdev_op_close = vdev_file_close,
|
|
|
|
.vdev_op_asize = vdev_default_asize,
|
|
|
|
.vdev_op_min_asize = vdev_default_min_asize,
|
|
|
|
.vdev_op_min_alloc = NULL,
|
|
|
|
.vdev_op_io_start = vdev_file_io_start,
|
|
|
|
.vdev_op_io_done = vdev_file_io_done,
|
|
|
|
.vdev_op_state_change = NULL,
|
|
|
|
.vdev_op_need_resilver = NULL,
|
|
|
|
.vdev_op_hold = vdev_file_hold,
|
|
|
|
.vdev_op_rele = vdev_file_rele,
|
|
|
|
.vdev_op_remap = NULL,
|
|
|
|
.vdev_op_xlate = vdev_default_xlate,
|
|
|
|
.vdev_op_rebuild_asize = NULL,
|
|
|
|
.vdev_op_metaslab_init = NULL,
|
|
|
|
.vdev_op_config_generate = NULL,
|
|
|
|
.vdev_op_nparity = NULL,
|
|
|
|
.vdev_op_ndisks = NULL,
|
|
|
|
.vdev_op_type = VDEV_TYPE_DISK, /* name of this vdev type */
|
|
|
|
.vdev_op_leaf = B_TRUE /* leaf vdev */
|
2020-04-14 21:36:28 +03:00
|
|
|
};
|
|
|
|
|
|
|
|
#endif
|
2020-09-18 22:13:47 +03:00
|
|
|
|
Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
* dbuf_cache_max_bytes
* dbuf_metadata_cache_max_bytes
* l2arc_feed_min_ms
* l2arc_feed_secs
* l2arc_headroom
* l2arc_headroom_boost
* l2arc_write_boost
* l2arc_write_max
* metaslab_aliquot
* metaslab_force_ganging
* zfetch_array_rd_sz
* zfs_arc_max
* zfs_arc_meta_limit
* zfs_arc_meta_min
* zfs_arc_min
* zfs_async_block_max_blocks
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
* zfs_deadman_checktime_ms
* zfs_deadman_synctime_ms
* zfs_initialize_chunk_size
* zfs_initialize_value
* zfs_lua_max_instrlimit
* zfs_lua_max_memlimit
* zil_slog_bulk
Parameters that were originally 32-bit on Illumos:
* zfs_per_txg_dirty_frees_percent
Parameters that were originally `ssize_t` on Illumos:
* zfs_immediate_write_sz
Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.
Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:
* l2arc_rebuild_blocks_min_l2size
* zfs_key_max_salt_uses
* zfs_max_log_walking
* zfs_max_logsm_summary_length
* zfs_metaslab_max_size_cache_sec
* zfs_min_metaslabs_to_flush
* zfs_multihost_interval
* zfs_unflushed_log_block_max
* zfs_unflushed_log_block_min
* zfs_unflushed_log_block_pct
* zfs_unflushed_max_mem_amt
* zfs_unflushed_max_mem_ppm
New parameters that do not exist in Illumos:
* l2arc_trim_ahead
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_arc_sys_free
* zfs_deadman_ziotime_ms
* zfs_delete_blocks
* zfs_history_output_max
* zfs_livelist_max_entries
* zfs_max_async_dedup_frees
* zfs_max_nvlist_src_size
* zfs_rebuild_max_segment
* zfs_rebuild_vdev_limit
* zfs_unflushed_log_txg_max
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
* zfs_vnops_read_chunk_size
* zvol_max_discard_blocks
Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.
A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:
* zfs_delete_blocks
* zfs_key_max_salt_uses
* zvol_max_discard_blocks
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_per_txg_dirty_frees_percent
* zfs_unflushed_log_block_pct
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
2022-10-03 22:06:54 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_vdev_file, vdev_file_, logical_ashift, UINT, ZMOD_RW,
|
2020-09-18 22:13:47 +03:00
|
|
|
"Logical ashift for file-based devices");
|
Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
* dbuf_cache_max_bytes
* dbuf_metadata_cache_max_bytes
* l2arc_feed_min_ms
* l2arc_feed_secs
* l2arc_headroom
* l2arc_headroom_boost
* l2arc_write_boost
* l2arc_write_max
* metaslab_aliquot
* metaslab_force_ganging
* zfetch_array_rd_sz
* zfs_arc_max
* zfs_arc_meta_limit
* zfs_arc_meta_min
* zfs_arc_min
* zfs_async_block_max_blocks
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
* zfs_deadman_checktime_ms
* zfs_deadman_synctime_ms
* zfs_initialize_chunk_size
* zfs_initialize_value
* zfs_lua_max_instrlimit
* zfs_lua_max_memlimit
* zil_slog_bulk
Parameters that were originally 32-bit on Illumos:
* zfs_per_txg_dirty_frees_percent
Parameters that were originally `ssize_t` on Illumos:
* zfs_immediate_write_sz
Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.
Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:
* l2arc_rebuild_blocks_min_l2size
* zfs_key_max_salt_uses
* zfs_max_log_walking
* zfs_max_logsm_summary_length
* zfs_metaslab_max_size_cache_sec
* zfs_min_metaslabs_to_flush
* zfs_multihost_interval
* zfs_unflushed_log_block_max
* zfs_unflushed_log_block_min
* zfs_unflushed_log_block_pct
* zfs_unflushed_max_mem_amt
* zfs_unflushed_max_mem_ppm
New parameters that do not exist in Illumos:
* l2arc_trim_ahead
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_arc_sys_free
* zfs_deadman_ziotime_ms
* zfs_delete_blocks
* zfs_history_output_max
* zfs_livelist_max_entries
* zfs_max_async_dedup_frees
* zfs_max_nvlist_src_size
* zfs_rebuild_max_segment
* zfs_rebuild_vdev_limit
* zfs_unflushed_log_txg_max
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
* zfs_vnops_read_chunk_size
* zvol_max_discard_blocks
Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.
A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:
* zfs_delete_blocks
* zfs_key_max_salt_uses
* zvol_max_discard_blocks
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_per_txg_dirty_frees_percent
* zfs_unflushed_log_block_pct
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
2022-10-03 22:06:54 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_vdev_file, vdev_file_, physical_ashift, UINT, ZMOD_RW,
|
2020-09-18 22:13:47 +03:00
|
|
|
"Physical ashift for file-based devices");
|