2025-01-04 03:04:27 +03:00
|
|
|
// SPDX-License-Identifier: CDDL-1.0
|
2010-08-26 22:45:02 +04:00
|
|
|
/*
|
|
|
|
|
* CDDL HEADER START
|
|
|
|
|
*
|
|
|
|
|
* The contents of this file are subject to the terms of the
|
|
|
|
|
* Common Development and Distribution License (the "License").
|
|
|
|
|
* You may not use this file except in compliance with the License.
|
|
|
|
|
*
|
|
|
|
|
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
|
2022-07-12 00:16:13 +03:00
|
|
|
* or https://opensource.org/licenses/CDDL-1.0.
|
2010-08-26 22:45:02 +04:00
|
|
|
* See the License for the specific language governing permissions
|
|
|
|
|
* and limitations under the License.
|
|
|
|
|
*
|
|
|
|
|
* When distributing Covered Code, include this CDDL HEADER in each
|
|
|
|
|
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
|
|
|
|
|
* If applicable, add the following below this CDDL HEADER, with the
|
|
|
|
|
* fields enclosed by brackets "[]" replaced with your own identifying
|
|
|
|
|
* information: Portions Copyright [yyyy] [name of copyright owner]
|
|
|
|
|
*
|
|
|
|
|
* CDDL HEADER END
|
|
|
|
|
*/
|
|
|
|
|
/*
|
|
|
|
|
* Copyright (C) 2008-2010 Lawrence Livermore National Security, LLC.
|
|
|
|
|
* Produced at Lawrence Livermore National Laboratory (cf, DISCLAIMER).
|
|
|
|
|
* Rewritten for Linux by Brian Behlendorf <behlendorf1@llnl.gov>.
|
|
|
|
|
* LLNL-CODE-403049.
|
|
|
|
|
*
|
|
|
|
|
* ZFS volume emulation driver.
|
|
|
|
|
*
|
|
|
|
|
* Makes a DMU object look like a volume of arbitrary size, up to 2^64 bytes.
|
|
|
|
|
* Volumes are accessed through the symbolic links named:
|
|
|
|
|
*
|
|
|
|
|
* /dev/<pool_name>/<dataset_name>
|
|
|
|
|
*
|
|
|
|
|
* Volumes are persistent through reboot and module load. No user command
|
|
|
|
|
* needs to be run before opening and using a device.
|
2015-08-02 16:01:14 +03:00
|
|
|
*
|
|
|
|
|
* Copyright 2014 Nexenta Systems, Inc. All rights reserved.
|
2016-02-16 22:52:55 +03:00
|
|
|
* Copyright (c) 2016 Actifio, Inc. All rights reserved.
|
2019-05-05 02:39:10 +03:00
|
|
|
* Copyright (c) 2012, 2019 by Delphix. All rights reserved.
|
zvol: remove the OS-side minor before freeing the zvol
When destroying a zvol, it is not "unpublished" from the system (that
is, /dev/zd* node removed) until zvol_os_free(). Under Linux, at the
time del_gendisk() and put_disk() are called, the device node may still
be have an active hold, from a userspace program or something inside the
kernel (a partition probe). As it is currently, this can lead to calls
to zvol_open() or zvol_release() while the zvol_state_t is partially or
fully freed. zvol_open() has some protection against this by checking
that private_data is NULL, but zvol_release does not.
This implements a better ordering for all of this by adding a new
OS-side method, zvol_os_remove_minor(), which is responsible for fully
decoupling the "private" (OS-side) objects from the zvol_state_t. For
Linux, that means calling put_disk(), nulling private_data, and freeing
zv_zso.
This takes the place of zvol_os_clear_private(), which was a nod in that
direction but did not do enough, and did not do it early enough.
Equivalent changes are made on the FreeBSD side to follow the API
change.
Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17625
2025-08-05 06:43:17 +03:00
|
|
|
* Copyright (c) 2024, 2025, Klara, Inc.
|
2010-08-26 22:45:02 +04:00
|
|
|
*/
|
|
|
|
|
|
2017-05-10 20:51:29 +03:00
|
|
|
/*
|
|
|
|
|
* Note on locking of zvol state structures.
|
|
|
|
|
*
|
zvol: stop using zvol_state_lock to protect OS-side private data
zvol_state_lock is intended to protect access to the global name->zvol
lists (zvol_find_by_name()), but has also been used to control access to
OS-side private data, accessed through whatever kernel object is used to
represent the volume (gendisk, geom, etc).
This appears to have been necessary to some degree because the OS-side
object is what's used to get a handle on zvol_state_t, so zv_state_lock
and zv_suspend_lock can't be used to manage access, but also, with the
private object and the zvol_state_t being shutdown and destroyed at the
same time in zvol_os_free(), we must ensure that the private object
pointer only ever corresponds to a real zvol_state_t, not one in partial
destruction. Taking the global lock seems like a convenient way to
ensure this.
The problem with this is that zvol_state_lock does not actually protect
access to the zvol_state_t internals, so we need to take zv_state_lock
and/or zv_suspend_lock. If those are contended, this can then cause
OS-side operations (eg zvol_open()) to sleep to wait for them while hold
zvol_state_lock. This then blocks out all other OS-side operations which
want to get the private data, and any ZFS-side control operations that
would take the write half of the lock. It's even worse if ZFS-side
operations induce OS-side calls back into the zvol (eg creating a zvol
triggers a partition probe inside the kernel, and also a userspace
access from udev to set up device links). And it gets even works again
if anything decides to defer those ops to a task and wait on them, which
zvol_remove_minors_impl() will do under high load.
However, since the previous commit, we have a guarantee that the private
data pointer will always be NULL'd out in zvol_os_remove_minor()
_before_ the zvol_state_t is made invalid, but it won't happen until all
users are ejected. So, if we make access to the private object pointer
atomic, we remove the need to take a global lockout to access it, and so
we can remove all acquisitions of zvol_state_lock from the OS side.
While here, I've rewritten much of the locking theory comment at the top
of zvol.c. It wasn't wrong, but it hadn't been followed exactly, so I've
tried to describe the purpose of each lock in a little more detail, and
in particular describe where it should and shouldn't be used.
Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17625
2025-08-05 07:19:24 +03:00
|
|
|
* zvol_state_t represents the connection between a single dataset
|
|
|
|
|
* (DMU_OST_ZVOL) and the device "minor" (some OS-specific representation of a
|
|
|
|
|
* "disk" or "device" or "volume", eg, a /dev/zdXX node, a GEOM object, etc).
|
2017-05-10 20:51:29 +03:00
|
|
|
*
|
zvol: stop using zvol_state_lock to protect OS-side private data
zvol_state_lock is intended to protect access to the global name->zvol
lists (zvol_find_by_name()), but has also been used to control access to
OS-side private data, accessed through whatever kernel object is used to
represent the volume (gendisk, geom, etc).
This appears to have been necessary to some degree because the OS-side
object is what's used to get a handle on zvol_state_t, so zv_state_lock
and zv_suspend_lock can't be used to manage access, but also, with the
private object and the zvol_state_t being shutdown and destroyed at the
same time in zvol_os_free(), we must ensure that the private object
pointer only ever corresponds to a real zvol_state_t, not one in partial
destruction. Taking the global lock seems like a convenient way to
ensure this.
The problem with this is that zvol_state_lock does not actually protect
access to the zvol_state_t internals, so we need to take zv_state_lock
and/or zv_suspend_lock. If those are contended, this can then cause
OS-side operations (eg zvol_open()) to sleep to wait for them while hold
zvol_state_lock. This then blocks out all other OS-side operations which
want to get the private data, and any ZFS-side control operations that
would take the write half of the lock. It's even worse if ZFS-side
operations induce OS-side calls back into the zvol (eg creating a zvol
triggers a partition probe inside the kernel, and also a userspace
access from udev to set up device links). And it gets even works again
if anything decides to defer those ops to a task and wait on them, which
zvol_remove_minors_impl() will do under high load.
However, since the previous commit, we have a guarantee that the private
data pointer will always be NULL'd out in zvol_os_remove_minor()
_before_ the zvol_state_t is made invalid, but it won't happen until all
users are ejected. So, if we make access to the private object pointer
atomic, we remove the need to take a global lockout to access it, and so
we can remove all acquisitions of zvol_state_lock from the OS side.
While here, I've rewritten much of the locking theory comment at the top
of zvol.c. It wasn't wrong, but it hadn't been followed exactly, so I've
tried to describe the purpose of each lock in a little more detail, and
in particular describe where it should and shouldn't be used.
Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17625
2025-08-05 07:19:24 +03:00
|
|
|
* The global zvol_state_lock is used to protect access to zvol_state_list and
|
|
|
|
|
* zvol_htable, which are the primary way to obtain a zvol_state_t from a name.
|
|
|
|
|
* It should not be used for anything not name-relateds, and you should avoid
|
|
|
|
|
* sleeping or waiting while its held. See zvol_find_by_name(), zvol_insert(),
|
|
|
|
|
* zvol_remove().
|
|
|
|
|
*
|
|
|
|
|
* The zv_state_lock is used to protect the contents of the associated
|
|
|
|
|
* zvol_state_t. Most of the zvol_state_t is dedicated to control and
|
|
|
|
|
* configuration; almost none of it is needed for data operations (that is,
|
|
|
|
|
* read, write, flush) so this lock is rarely taken during general IO. It
|
|
|
|
|
* should be released quickly; you should avoid sleeping or waiting while its
|
|
|
|
|
* held.
|
|
|
|
|
*
|
|
|
|
|
* zv_suspend_lock is used to suspend IO/data operations to a zvol. The read
|
|
|
|
|
* half should held for the duration of an IO operation. The write half should
|
|
|
|
|
* be taken when something to wait for IO to complete and the block further IO,
|
|
|
|
|
* eg for the duration of receive and rollback operations. This lock can be
|
|
|
|
|
* held for long periods of time.
|
|
|
|
|
*
|
|
|
|
|
* Thus, the following lock ordering appies.
|
2017-06-13 19:03:44 +03:00
|
|
|
* - take zvol_state_lock if necessary, to protect zvol_state_list
|
|
|
|
|
* - take zv_suspend_lock if necessary, by the code path in question
|
|
|
|
|
* - take zv_state_lock to protect zvol_state_t
|
|
|
|
|
*
|
|
|
|
|
* The minor operations are issued to spa->spa_zvol_taskq queues, that are
|
2017-05-10 20:51:29 +03:00
|
|
|
* single-threaded (to preserve order of minor operations), and are executed
|
|
|
|
|
* through the zvol_task_cb that dispatches the specific operations. Therefore,
|
|
|
|
|
* these operations are serialized per pool. Consequently, we can be certain
|
|
|
|
|
* that for a given zvol, there is only one operation at a time in progress.
|
|
|
|
|
* That is why one can be sure that first, zvol_state_t for a given zvol is
|
zvol: stop using zvol_state_lock to protect OS-side private data
zvol_state_lock is intended to protect access to the global name->zvol
lists (zvol_find_by_name()), but has also been used to control access to
OS-side private data, accessed through whatever kernel object is used to
represent the volume (gendisk, geom, etc).
This appears to have been necessary to some degree because the OS-side
object is what's used to get a handle on zvol_state_t, so zv_state_lock
and zv_suspend_lock can't be used to manage access, but also, with the
private object and the zvol_state_t being shutdown and destroyed at the
same time in zvol_os_free(), we must ensure that the private object
pointer only ever corresponds to a real zvol_state_t, not one in partial
destruction. Taking the global lock seems like a convenient way to
ensure this.
The problem with this is that zvol_state_lock does not actually protect
access to the zvol_state_t internals, so we need to take zv_state_lock
and/or zv_suspend_lock. If those are contended, this can then cause
OS-side operations (eg zvol_open()) to sleep to wait for them while hold
zvol_state_lock. This then blocks out all other OS-side operations which
want to get the private data, and any ZFS-side control operations that
would take the write half of the lock. It's even worse if ZFS-side
operations induce OS-side calls back into the zvol (eg creating a zvol
triggers a partition probe inside the kernel, and also a userspace
access from udev to set up device links). And it gets even works again
if anything decides to defer those ops to a task and wait on them, which
zvol_remove_minors_impl() will do under high load.
However, since the previous commit, we have a guarantee that the private
data pointer will always be NULL'd out in zvol_os_remove_minor()
_before_ the zvol_state_t is made invalid, but it won't happen until all
users are ejected. So, if we make access to the private object pointer
atomic, we remove the need to take a global lockout to access it, and so
we can remove all acquisitions of zvol_state_lock from the OS side.
While here, I've rewritten much of the locking theory comment at the top
of zvol.c. It wasn't wrong, but it hadn't been followed exactly, so I've
tried to describe the purpose of each lock in a little more detail, and
in particular describe where it should and shouldn't be used.
Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17625
2025-08-05 07:19:24 +03:00
|
|
|
* allocated and placed on zvol_state_list, and then other minor operations for
|
|
|
|
|
* this zvol are going to proceed in the order of issue.
|
2017-05-10 20:51:29 +03:00
|
|
|
*/
|
|
|
|
|
|
2018-08-20 19:52:37 +03:00
|
|
|
#include <sys/dataset_kstats.h>
|
2013-05-10 23:47:54 +04:00
|
|
|
#include <sys/dbuf.h>
|
2010-08-26 22:45:02 +04:00
|
|
|
#include <sys/dmu_traverse.h>
|
|
|
|
|
#include <sys/dsl_dataset.h>
|
|
|
|
|
#include <sys/dsl_prop.h>
|
2014-03-22 13:07:14 +04:00
|
|
|
#include <sys/dsl_dir.h>
|
2010-08-26 22:45:02 +04:00
|
|
|
#include <sys/zap.h>
|
2015-08-25 00:18:48 +03:00
|
|
|
#include <sys/zfeature.h>
|
2010-08-26 22:45:02 +04:00
|
|
|
#include <sys/zil_impl.h>
|
2015-08-02 16:01:14 +03:00
|
|
|
#include <sys/dmu_tx.h>
|
2010-08-26 22:45:02 +04:00
|
|
|
#include <sys/zio.h>
|
|
|
|
|
#include <sys/zfs_rlock.h>
|
2014-03-22 13:07:14 +04:00
|
|
|
#include <sys/spa_impl.h>
|
2010-08-26 22:45:02 +04:00
|
|
|
#include <sys/zvol.h>
|
2019-09-25 19:20:30 +03:00
|
|
|
#include <sys/zvol_impl.h>
|
|
|
|
|
|
2012-06-02 05:49:10 +04:00
|
|
|
unsigned int zvol_inhibit_dev = 0;
|
2025-05-31 16:58:54 +03:00
|
|
|
unsigned int zvol_prefetch_bytes = (128 * 1024);
|
2017-07-12 23:05:37 +03:00
|
|
|
unsigned int zvol_volmode = ZFS_VOLMODE_GEOM;
|
2025-05-08 22:25:40 +03:00
|
|
|
unsigned int zvol_threads = 0;
|
|
|
|
|
unsigned int zvol_num_taskqs = 0;
|
|
|
|
|
unsigned int zvol_request_sync = 0;
|
2010-08-26 22:45:02 +04:00
|
|
|
|
2019-09-25 19:20:30 +03:00
|
|
|
struct hlist_head *zvol_htable;
|
2022-01-15 02:37:55 +03:00
|
|
|
static list_t zvol_state_list;
|
2019-09-25 19:20:30 +03:00
|
|
|
krwlock_t zvol_state_lock;
|
2024-12-29 22:41:30 +03:00
|
|
|
extern int zfs_bclone_wait_dirty;
|
2025-05-08 22:25:40 +03:00
|
|
|
zv_taskq_t zvol_taskqs;
|
2010-08-26 22:45:02 +04:00
|
|
|
|
2014-03-22 13:07:14 +04:00
|
|
|
typedef enum {
|
2025-08-06 17:10:52 +03:00
|
|
|
ZVOL_ASYNC_CREATE_MINORS,
|
2014-03-22 13:07:14 +04:00
|
|
|
ZVOL_ASYNC_REMOVE_MINORS,
|
|
|
|
|
ZVOL_ASYNC_RENAME_MINORS,
|
|
|
|
|
ZVOL_ASYNC_SET_SNAPDEV,
|
2017-07-12 23:05:37 +03:00
|
|
|
ZVOL_ASYNC_SET_VOLMODE,
|
2014-03-22 13:07:14 +04:00
|
|
|
ZVOL_ASYNC_MAX
|
|
|
|
|
} zvol_async_op_t;
|
|
|
|
|
|
|
|
|
|
typedef struct {
|
2025-08-06 17:10:52 +03:00
|
|
|
zvol_async_op_t zt_op;
|
|
|
|
|
char zt_name1[MAXNAMELEN];
|
|
|
|
|
char zt_name2[MAXNAMELEN];
|
|
|
|
|
uint64_t zt_value;
|
|
|
|
|
uint32_t zt_total;
|
|
|
|
|
uint32_t zt_done;
|
|
|
|
|
int32_t zt_status;
|
|
|
|
|
int zt_error;
|
2014-03-22 13:07:14 +04:00
|
|
|
} zvol_task_t;
|
|
|
|
|
|
2025-05-08 22:25:40 +03:00
|
|
|
zv_request_task_t *
|
|
|
|
|
zv_request_task_create(zv_request_t zvr)
|
|
|
|
|
{
|
|
|
|
|
zv_request_task_t *task;
|
|
|
|
|
task = kmem_alloc(sizeof (zv_request_task_t), KM_SLEEP);
|
|
|
|
|
taskq_init_ent(&task->ent);
|
|
|
|
|
task->zvr = zvr;
|
|
|
|
|
return (task);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
void
|
|
|
|
|
zv_request_task_free(zv_request_task_t *task)
|
|
|
|
|
{
|
|
|
|
|
kmem_free(task, sizeof (*task));
|
|
|
|
|
}
|
|
|
|
|
|
2019-09-25 19:20:30 +03:00
|
|
|
uint64_t
|
2016-12-01 00:56:50 +03:00
|
|
|
zvol_name_hash(const char *name)
|
2010-08-26 22:45:02 +04:00
|
|
|
{
|
2016-12-01 00:56:50 +03:00
|
|
|
uint64_t crc = -1ULL;
|
|
|
|
|
ASSERT(zfs_crc64_table[128] == ZFS_CRC64_POLY);
|
2023-11-28 00:16:59 +03:00
|
|
|
for (const uint8_t *p = (const uint8_t *)name; *p != 0; p++)
|
2016-12-01 00:56:50 +03:00
|
|
|
crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ (*p)) & 0xFF];
|
|
|
|
|
return (crc);
|
2010-08-26 22:45:02 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
2016-12-01 00:56:50 +03:00
|
|
|
* Find a zvol_state_t given the name and hash generated by zvol_name_hash.
|
2017-06-13 19:03:44 +03:00
|
|
|
* If found, return with zv_suspend_lock and zv_state_lock taken, otherwise,
|
|
|
|
|
* return (NULL) without the taking locks. The zv_suspend_lock is always taken
|
|
|
|
|
* before zv_state_lock. The mode argument indicates the mode (including none)
|
|
|
|
|
* for zv_suspend_lock to be taken.
|
2010-08-26 22:45:02 +04:00
|
|
|
*/
|
2019-09-25 19:20:30 +03:00
|
|
|
zvol_state_t *
|
2017-06-13 19:03:44 +03:00
|
|
|
zvol_find_by_name_hash(const char *name, uint64_t hash, int mode)
|
2010-08-26 22:45:02 +04:00
|
|
|
{
|
|
|
|
|
zvol_state_t *zv;
|
2017-11-19 01:08:00 +03:00
|
|
|
struct hlist_node *p = NULL;
|
2010-08-26 22:45:02 +04:00
|
|
|
|
2018-06-16 01:05:21 +03:00
|
|
|
rw_enter(&zvol_state_lock, RW_READER);
|
2016-12-01 00:56:50 +03:00
|
|
|
hlist_for_each(p, ZVOL_HT_HEAD(hash)) {
|
|
|
|
|
zv = hlist_entry(p, zvol_state_t, zv_hlink);
|
2017-06-13 19:03:44 +03:00
|
|
|
mutex_enter(&zv->zv_state_lock);
|
2023-11-28 00:16:59 +03:00
|
|
|
if (zv->zv_hash == hash && strcmp(zv->zv_name, name) == 0) {
|
2017-06-13 19:03:44 +03:00
|
|
|
/*
|
|
|
|
|
* this is the right zvol, take the locks in the
|
|
|
|
|
* right order
|
|
|
|
|
*/
|
|
|
|
|
if (mode != RW_NONE &&
|
|
|
|
|
!rw_tryenter(&zv->zv_suspend_lock, mode)) {
|
|
|
|
|
mutex_exit(&zv->zv_state_lock);
|
|
|
|
|
rw_enter(&zv->zv_suspend_lock, mode);
|
|
|
|
|
mutex_enter(&zv->zv_state_lock);
|
|
|
|
|
/*
|
|
|
|
|
* zvol cannot be renamed as we continue
|
|
|
|
|
* to hold zvol_state_lock
|
|
|
|
|
*/
|
|
|
|
|
ASSERT(zv->zv_hash == hash &&
|
2023-11-28 00:16:59 +03:00
|
|
|
strcmp(zv->zv_name, name) == 0);
|
2017-06-13 19:03:44 +03:00
|
|
|
}
|
2018-06-16 01:05:21 +03:00
|
|
|
rw_exit(&zvol_state_lock);
|
2013-12-13 01:04:40 +04:00
|
|
|
return (zv);
|
2017-06-13 19:03:44 +03:00
|
|
|
}
|
|
|
|
|
mutex_exit(&zv->zv_state_lock);
|
2010-08-26 22:45:02 +04:00
|
|
|
}
|
2018-06-16 01:05:21 +03:00
|
|
|
rw_exit(&zvol_state_lock);
|
2017-06-13 19:03:44 +03:00
|
|
|
|
2013-12-13 01:04:40 +04:00
|
|
|
return (NULL);
|
2010-08-26 22:45:02 +04:00
|
|
|
}
|
|
|
|
|
|
2016-12-01 00:56:50 +03:00
|
|
|
/*
|
2017-06-13 19:03:44 +03:00
|
|
|
* Find a zvol_state_t given the name.
|
|
|
|
|
* If found, return with zv_suspend_lock and zv_state_lock taken, otherwise,
|
|
|
|
|
* return (NULL) without the taking locks. The zv_suspend_lock is always taken
|
|
|
|
|
* before zv_state_lock. The mode argument indicates the mode (including none)
|
|
|
|
|
* for zv_suspend_lock to be taken.
|
2016-12-01 00:56:50 +03:00
|
|
|
*/
|
|
|
|
|
static zvol_state_t *
|
2017-06-13 19:03:44 +03:00
|
|
|
zvol_find_by_name(const char *name, int mode)
|
2016-12-01 00:56:50 +03:00
|
|
|
{
|
2017-06-13 19:03:44 +03:00
|
|
|
return (zvol_find_by_name_hash(name, zvol_name_hash(name), mode));
|
2016-12-01 00:56:50 +03:00
|
|
|
}
|
|
|
|
|
|
2010-08-26 22:45:02 +04:00
|
|
|
/*
|
|
|
|
|
* ZFS_IOC_CREATE callback handles dmu zvol and zap object creation.
|
|
|
|
|
*/
|
|
|
|
|
void
|
|
|
|
|
zvol_create_cb(objset_t *os, void *arg, cred_t *cr, dmu_tx_t *tx)
|
|
|
|
|
{
|
|
|
|
|
zfs_creat_t *zct = arg;
|
|
|
|
|
nvlist_t *nvprops = zct->zct_props;
|
|
|
|
|
int error;
|
|
|
|
|
uint64_t volblocksize, volsize;
|
|
|
|
|
|
2025-08-04 05:07:14 +03:00
|
|
|
VERIFY0(nvlist_lookup_uint64(nvprops,
|
|
|
|
|
zfs_prop_to_name(ZFS_PROP_VOLSIZE), &volsize));
|
2010-08-26 22:45:02 +04:00
|
|
|
if (nvlist_lookup_uint64(nvprops,
|
|
|
|
|
zfs_prop_to_name(ZFS_PROP_VOLBLOCKSIZE), &volblocksize) != 0)
|
|
|
|
|
volblocksize = zfs_prop_default_numeric(ZFS_PROP_VOLBLOCKSIZE);
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* These properties must be removed from the list so the generic
|
|
|
|
|
* property setting step won't apply to them.
|
|
|
|
|
*/
|
2025-08-04 05:07:14 +03:00
|
|
|
VERIFY0(nvlist_remove_all(nvprops, zfs_prop_to_name(ZFS_PROP_VOLSIZE)));
|
2010-08-26 22:45:02 +04:00
|
|
|
(void) nvlist_remove_all(nvprops,
|
|
|
|
|
zfs_prop_to_name(ZFS_PROP_VOLBLOCKSIZE));
|
|
|
|
|
|
|
|
|
|
error = dmu_object_claim(os, ZVOL_OBJ, DMU_OT_ZVOL, volblocksize,
|
|
|
|
|
DMU_OT_NONE, 0, tx);
|
2025-08-04 05:07:14 +03:00
|
|
|
ASSERT0(error);
|
2010-08-26 22:45:02 +04:00
|
|
|
|
|
|
|
|
error = zap_create_claim(os, ZVOL_ZAP_OBJ, DMU_OT_ZVOL_PROP,
|
|
|
|
|
DMU_OT_NONE, 0, tx);
|
2025-08-04 05:07:14 +03:00
|
|
|
ASSERT0(error);
|
2010-08-26 22:45:02 +04:00
|
|
|
|
|
|
|
|
error = zap_update(os, ZVOL_ZAP_OBJ, "size", 8, 1, &volsize, tx);
|
2025-08-04 05:07:14 +03:00
|
|
|
ASSERT0(error);
|
2010-08-26 22:45:02 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* ZFS_IOC_OBJSET_STATS entry point.
|
|
|
|
|
*/
|
|
|
|
|
int
|
|
|
|
|
zvol_get_stats(objset_t *os, nvlist_t *nv)
|
|
|
|
|
{
|
|
|
|
|
int error;
|
|
|
|
|
dmu_object_info_t *doi;
|
|
|
|
|
uint64_t val;
|
|
|
|
|
|
|
|
|
|
error = zap_lookup(os, ZVOL_ZAP_OBJ, "size", 8, 1, &val);
|
|
|
|
|
if (error)
|
2025-08-09 03:04:01 +03:00
|
|
|
return (error);
|
2010-08-26 22:45:02 +04:00
|
|
|
|
|
|
|
|
dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_VOLSIZE, val);
|
2013-12-13 01:04:40 +04:00
|
|
|
doi = kmem_alloc(sizeof (dmu_object_info_t), KM_SLEEP);
|
2010-08-26 22:45:02 +04:00
|
|
|
error = dmu_object_info(os, ZVOL_OBJ, doi);
|
|
|
|
|
|
|
|
|
|
if (error == 0) {
|
|
|
|
|
dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_VOLBLOCKSIZE,
|
|
|
|
|
doi->doi_data_block_size);
|
|
|
|
|
}
|
|
|
|
|
|
2013-12-13 01:04:40 +04:00
|
|
|
kmem_free(doi, sizeof (dmu_object_info_t));
|
2010-08-26 22:45:02 +04:00
|
|
|
|
2025-08-09 03:04:01 +03:00
|
|
|
return (error);
|
2010-08-26 22:45:02 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Sanity check volume size.
|
|
|
|
|
*/
|
|
|
|
|
int
|
|
|
|
|
zvol_check_volsize(uint64_t volsize, uint64_t blocksize)
|
|
|
|
|
{
|
|
|
|
|
if (volsize == 0)
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(EINVAL));
|
2010-08-26 22:45:02 +04:00
|
|
|
|
|
|
|
|
if (volsize % blocksize != 0)
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(EINVAL));
|
2010-08-26 22:45:02 +04:00
|
|
|
|
|
|
|
|
#ifdef _ILP32
|
2016-10-29 02:53:24 +03:00
|
|
|
if (volsize - 1 > SPEC_MAXOFFSET_T)
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(EOVERFLOW));
|
2010-08-26 22:45:02 +04:00
|
|
|
#endif
|
|
|
|
|
return (0);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Ensure the zap is flushed then inform the VFS of the capacity change.
|
|
|
|
|
*/
|
|
|
|
|
static int
|
2014-01-14 02:27:33 +04:00
|
|
|
zvol_update_volsize(uint64_t volsize, objset_t *os)
|
2010-08-26 22:45:02 +04:00
|
|
|
{
|
|
|
|
|
dmu_tx_t *tx;
|
|
|
|
|
int error;
|
2016-02-26 10:33:44 +03:00
|
|
|
uint64_t txg;
|
2010-08-26 22:45:02 +04:00
|
|
|
|
2011-02-25 10:36:01 +03:00
|
|
|
tx = dmu_tx_create(os);
|
2010-08-26 22:45:02 +04:00
|
|
|
dmu_tx_hold_zap(tx, ZVOL_ZAP_OBJ, TRUE, NULL);
|
2014-07-07 23:49:36 +04:00
|
|
|
dmu_tx_mark_netfree(tx);
|
2025-03-19 02:04:22 +03:00
|
|
|
error = dmu_tx_assign(tx, DMU_TX_WAIT);
|
2010-08-26 22:45:02 +04:00
|
|
|
if (error) {
|
|
|
|
|
dmu_tx_abort(tx);
|
2025-08-09 03:04:01 +03:00
|
|
|
return (error);
|
2010-08-26 22:45:02 +04:00
|
|
|
}
|
2016-02-26 10:33:44 +03:00
|
|
|
txg = dmu_tx_get_txg(tx);
|
2010-08-26 22:45:02 +04:00
|
|
|
|
2011-02-25 10:36:01 +03:00
|
|
|
error = zap_update(os, ZVOL_ZAP_OBJ, "size", 8, 1,
|
2010-08-26 22:45:02 +04:00
|
|
|
&volsize, tx);
|
|
|
|
|
dmu_tx_commit(tx);
|
|
|
|
|
|
2016-02-26 10:33:44 +03:00
|
|
|
txg_wait_synced(dmu_objset_pool(os), txg);
|
|
|
|
|
|
2014-01-14 02:27:33 +04:00
|
|
|
if (error == 0)
|
|
|
|
|
error = dmu_free_long_range(os,
|
|
|
|
|
ZVOL_OBJ, volsize, DMU_OBJECT_END);
|
2010-08-26 22:45:02 +04:00
|
|
|
|
2014-01-14 02:27:33 +04:00
|
|
|
return (error);
|
|
|
|
|
}
|
2010-08-26 22:45:02 +04:00
|
|
|
|
|
|
|
|
/*
|
2018-06-16 01:05:21 +03:00
|
|
|
* Set ZFS_PROP_VOLSIZE set entry point. Note that modifying the volume
|
|
|
|
|
* size will result in a udev "change" event being generated.
|
2010-08-26 22:45:02 +04:00
|
|
|
*/
|
|
|
|
|
int
|
|
|
|
|
zvol_set_volsize(const char *name, uint64_t volsize)
|
|
|
|
|
{
|
|
|
|
|
objset_t *os = NULL;
|
2014-01-14 02:27:33 +04:00
|
|
|
uint64_t readonly;
|
2018-06-16 01:05:21 +03:00
|
|
|
int error;
|
2014-01-14 02:27:33 +04:00
|
|
|
boolean_t owned = B_FALSE;
|
2010-08-26 22:45:02 +04:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dsl_prop_get_integer(name,
|
|
|
|
|
zfs_prop_to_name(ZFS_PROP_READONLY), &readonly, NULL);
|
|
|
|
|
if (error != 0)
|
2025-08-09 03:04:01 +03:00
|
|
|
return (error);
|
2013-09-04 16:00:57 +04:00
|
|
|
if (readonly)
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(EROFS));
|
2013-09-04 16:00:57 +04:00
|
|
|
|
2018-06-16 01:05:21 +03:00
|
|
|
zvol_state_t *zv = zvol_find_by_name(name, RW_READER);
|
2017-06-13 19:03:44 +03:00
|
|
|
|
|
|
|
|
ASSERT(zv == NULL || (MUTEX_HELD(&zv->zv_state_lock) &&
|
|
|
|
|
RW_READ_HELD(&zv->zv_suspend_lock)));
|
2014-01-14 02:27:33 +04:00
|
|
|
|
|
|
|
|
if (zv == NULL || zv->zv_objset == NULL) {
|
2017-06-13 19:03:44 +03:00
|
|
|
if (zv != NULL)
|
|
|
|
|
rw_exit(&zv->zv_suspend_lock);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if ((error = dmu_objset_own(name, DMU_OST_ZVOL, B_FALSE, B_TRUE,
|
2014-01-14 02:27:33 +04:00
|
|
|
FTAG, &os)) != 0) {
|
2017-05-10 20:51:29 +03:00
|
|
|
if (zv != NULL)
|
|
|
|
|
mutex_exit(&zv->zv_state_lock);
|
2025-08-09 03:04:01 +03:00
|
|
|
return (error);
|
2014-01-14 02:27:33 +04:00
|
|
|
}
|
|
|
|
|
owned = B_TRUE;
|
|
|
|
|
if (zv != NULL)
|
|
|
|
|
zv->zv_objset = os;
|
|
|
|
|
} else {
|
|
|
|
|
os = zv->zv_objset;
|
2010-08-26 22:45:02 +04:00
|
|
|
}
|
|
|
|
|
|
2018-06-16 01:05:21 +03:00
|
|
|
dmu_object_info_t *doi = kmem_alloc(sizeof (*doi), KM_SLEEP);
|
2010-08-26 22:45:02 +04:00
|
|
|
|
2013-12-13 01:04:40 +04:00
|
|
|
if ((error = dmu_object_info(os, ZVOL_OBJ, doi)) ||
|
|
|
|
|
(error = zvol_check_volsize(volsize, doi->doi_data_block_size)))
|
2014-01-14 02:27:33 +04:00
|
|
|
goto out;
|
2010-08-26 22:45:02 +04:00
|
|
|
|
2014-01-14 02:27:33 +04:00
|
|
|
error = zvol_update_volsize(volsize, os);
|
2018-06-16 01:05:21 +03:00
|
|
|
if (error == 0 && zv != NULL) {
|
|
|
|
|
zv->zv_volsize = volsize;
|
|
|
|
|
zv->zv_changed = 1;
|
|
|
|
|
}
|
2014-01-14 02:27:33 +04:00
|
|
|
out:
|
2017-05-31 22:52:12 +03:00
|
|
|
kmem_free(doi, sizeof (dmu_object_info_t));
|
|
|
|
|
|
2014-01-14 02:27:33 +04:00
|
|
|
if (owned) {
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dmu_objset_disown(os, B_TRUE, FTAG);
|
2014-01-14 02:27:33 +04:00
|
|
|
if (zv != NULL)
|
|
|
|
|
zv->zv_objset = NULL;
|
2017-01-20 00:56:36 +03:00
|
|
|
} else {
|
|
|
|
|
rw_exit(&zv->zv_suspend_lock);
|
2014-01-14 02:27:33 +04:00
|
|
|
}
|
2017-05-10 20:51:29 +03:00
|
|
|
|
|
|
|
|
if (zv != NULL)
|
|
|
|
|
mutex_exit(&zv->zv_state_lock);
|
2017-06-13 19:03:44 +03:00
|
|
|
|
2019-09-25 19:20:30 +03:00
|
|
|
if (error == 0 && zv != NULL)
|
2022-02-07 21:24:38 +03:00
|
|
|
zvol_os_update_volsize(zv, volsize);
|
2018-06-16 01:05:21 +03:00
|
|
|
|
2025-08-09 03:04:01 +03:00
|
|
|
return (error);
|
2010-08-26 22:45:02 +04:00
|
|
|
}
|
|
|
|
|
|
2023-10-25 00:53:27 +03:00
|
|
|
/*
|
|
|
|
|
* Update volthreading.
|
|
|
|
|
*/
|
|
|
|
|
int
|
|
|
|
|
zvol_set_volthreading(const char *name, boolean_t value)
|
|
|
|
|
{
|
|
|
|
|
zvol_state_t *zv = zvol_find_by_name(name, RW_NONE);
|
|
|
|
|
if (zv == NULL)
|
2025-08-09 03:04:01 +03:00
|
|
|
return (SET_ERROR(ENOENT));
|
2023-10-25 00:53:27 +03:00
|
|
|
zv->zv_threading = value;
|
|
|
|
|
mutex_exit(&zv->zv_state_lock);
|
|
|
|
|
return (0);
|
|
|
|
|
}
|
|
|
|
|
|
2023-10-17 22:19:58 +03:00
|
|
|
/*
|
|
|
|
|
* Update zvol ro property.
|
|
|
|
|
*/
|
|
|
|
|
int
|
|
|
|
|
zvol_set_ro(const char *name, boolean_t value)
|
|
|
|
|
{
|
|
|
|
|
zvol_state_t *zv = zvol_find_by_name(name, RW_NONE);
|
|
|
|
|
if (zv == NULL)
|
|
|
|
|
return (-1);
|
|
|
|
|
if (value) {
|
|
|
|
|
zvol_os_set_disk_ro(zv, 1);
|
|
|
|
|
zv->zv_flags |= ZVOL_RDONLY;
|
|
|
|
|
} else {
|
|
|
|
|
zvol_os_set_disk_ro(zv, 0);
|
|
|
|
|
zv->zv_flags &= ~ZVOL_RDONLY;
|
|
|
|
|
}
|
|
|
|
|
mutex_exit(&zv->zv_state_lock);
|
|
|
|
|
return (0);
|
|
|
|
|
}
|
|
|
|
|
|
2010-08-26 22:45:02 +04:00
|
|
|
/*
|
|
|
|
|
* Sanity check volume block size.
|
|
|
|
|
*/
|
|
|
|
|
int
|
2015-08-25 00:18:48 +03:00
|
|
|
zvol_check_volblocksize(const char *name, uint64_t volblocksize)
|
2010-08-26 22:45:02 +04:00
|
|
|
{
|
2015-08-25 00:18:48 +03:00
|
|
|
/* Record sizes above 128k need the feature to be enabled */
|
|
|
|
|
if (volblocksize > SPA_OLD_MAXBLOCKSIZE) {
|
|
|
|
|
spa_t *spa;
|
|
|
|
|
int error;
|
|
|
|
|
|
|
|
|
|
if ((error = spa_open(name, &spa, FTAG)) != 0)
|
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
|
|
if (!spa_feature_is_enabled(spa, SPA_FEATURE_LARGE_BLOCKS)) {
|
|
|
|
|
spa_close(spa, FTAG);
|
|
|
|
|
return (SET_ERROR(ENOTSUP));
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* We don't allow setting the property above 1MB,
|
|
|
|
|
* unless the tunable has been changed.
|
|
|
|
|
*/
|
2025-07-25 06:17:18 +03:00
|
|
|
if (volblocksize > zfs_max_recordsize) {
|
|
|
|
|
spa_close(spa, FTAG);
|
2015-08-25 00:18:48 +03:00
|
|
|
return (SET_ERROR(EDOM));
|
2025-07-25 06:17:18 +03:00
|
|
|
}
|
2015-08-25 00:18:48 +03:00
|
|
|
|
|
|
|
|
spa_close(spa, FTAG);
|
|
|
|
|
}
|
|
|
|
|
|
2010-08-26 22:45:02 +04:00
|
|
|
if (volblocksize < SPA_MINBLOCKSIZE ||
|
|
|
|
|
volblocksize > SPA_MAXBLOCKSIZE ||
|
|
|
|
|
!ISP2(volblocksize))
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(EDOM));
|
2010-08-26 22:45:02 +04:00
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
|
}
|
|
|
|
|
|
2015-08-02 16:01:14 +03:00
|
|
|
/*
|
|
|
|
|
* Replay a TX_TRUNCATE ZIL transaction if asked. TX_TRUNCATE is how we
|
|
|
|
|
* implement DKIOCFREE/free-long-range.
|
|
|
|
|
*/
|
|
|
|
|
static int
|
2017-10-27 22:46:35 +03:00
|
|
|
zvol_replay_truncate(void *arg1, void *arg2, boolean_t byteswap)
|
2015-08-02 16:01:14 +03:00
|
|
|
{
|
2017-10-27 22:46:35 +03:00
|
|
|
zvol_state_t *zv = arg1;
|
|
|
|
|
lr_truncate_t *lr = arg2;
|
2015-08-02 16:01:14 +03:00
|
|
|
uint64_t offset, length;
|
|
|
|
|
|
2023-11-29 00:35:14 +03:00
|
|
|
ASSERT3U(lr->lr_common.lrc_reclen, >=, sizeof (*lr));
|
|
|
|
|
|
2015-08-02 16:01:14 +03:00
|
|
|
if (byteswap)
|
|
|
|
|
byteswap_uint64_array(lr, sizeof (*lr));
|
|
|
|
|
|
|
|
|
|
offset = lr->lr_offset;
|
|
|
|
|
length = lr->lr_length;
|
|
|
|
|
|
zvol: call zil_replaying() during replay
zil_replaying(zil, tx) has the side-effect of informing the ZIL that an
entry has been replayed in the (still open) tx. The ZIL uses that
information to record the replay progress in the ZIL header when that
tx's txg syncs.
ZPL log entries are not idempotent and logically dependent and thus
calling zil_replaying() is necessary for correctness.
For ZVOLs the question of correctness is more nuanced: ZVOL logs only
TX_WRITE and TX_TRUNCATE, both of which are idempotent. Logical
dependencies between two records exist only if the write or discard
request had sync semantics or if the ranges affected by the records
overlap.
Thus, at a first glance, it would be correct to restart replay from
the beginning if we crash before replay completes. But this does not
address the following scenario:
Assume one log record per LWB.
The chain on disk is
HDR -> 1:W(1, "A") -> 2:W(1, "B") -> 3:W(2, "X") -> 4:W(3, "Z")
where N:W(O, C) represents log entry number N which is a TX_WRITE of C
to offset A.
We replay 1, 2 and 3 in one txg, sync that txg, then crash.
Bit flips corrupt 2, 3, and 4.
We come up again and restart replay from the beginning because
we did not call zil_replaying() during replay.
We replay 1 again, then interpret 2's invalid checksum as the end
of the ZIL chain and call replay done.
The replayed zvol content is "AX".
If we had called zil_replaying() the HDR would have pointed to 3
and our resumed replay would not have replayed anything because
3 was corrupted, resulting in zvol content "BX".
If 3 logically depends on 2 then the replay corrupted the ZVOL_OBJ's
contents.
This patch adds the zil_replaying() calls to the replay functions.
Since the callbacks in the replay function need the zilog_t* pointer
so that they can call zil_replaying() we open the ZIL while
replaying in zvol_create_minor(). We also verify that replay has
been done when on-demand-opening the ZIL on the first modifying
bio.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #11667
2021-03-07 20:49:58 +03:00
|
|
|
dmu_tx_t *tx = dmu_tx_create(zv->zv_objset);
|
|
|
|
|
dmu_tx_mark_netfree(tx);
|
2025-03-19 02:04:22 +03:00
|
|
|
int error = dmu_tx_assign(tx, DMU_TX_WAIT);
|
zvol: call zil_replaying() during replay
zil_replaying(zil, tx) has the side-effect of informing the ZIL that an
entry has been replayed in the (still open) tx. The ZIL uses that
information to record the replay progress in the ZIL header when that
tx's txg syncs.
ZPL log entries are not idempotent and logically dependent and thus
calling zil_replaying() is necessary for correctness.
For ZVOLs the question of correctness is more nuanced: ZVOL logs only
TX_WRITE and TX_TRUNCATE, both of which are idempotent. Logical
dependencies between two records exist only if the write or discard
request had sync semantics or if the ranges affected by the records
overlap.
Thus, at a first glance, it would be correct to restart replay from
the beginning if we crash before replay completes. But this does not
address the following scenario:
Assume one log record per LWB.
The chain on disk is
HDR -> 1:W(1, "A") -> 2:W(1, "B") -> 3:W(2, "X") -> 4:W(3, "Z")
where N:W(O, C) represents log entry number N which is a TX_WRITE of C
to offset A.
We replay 1, 2 and 3 in one txg, sync that txg, then crash.
Bit flips corrupt 2, 3, and 4.
We come up again and restart replay from the beginning because
we did not call zil_replaying() during replay.
We replay 1 again, then interpret 2's invalid checksum as the end
of the ZIL chain and call replay done.
The replayed zvol content is "AX".
If we had called zil_replaying() the HDR would have pointed to 3
and our resumed replay would not have replayed anything because
3 was corrupted, resulting in zvol content "BX".
If 3 logically depends on 2 then the replay corrupted the ZVOL_OBJ's
contents.
This patch adds the zil_replaying() calls to the replay functions.
Since the callbacks in the replay function need the zilog_t* pointer
so that they can call zil_replaying() we open the ZIL while
replaying in zvol_create_minor(). We also verify that replay has
been done when on-demand-opening the ZIL on the first modifying
bio.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #11667
2021-03-07 20:49:58 +03:00
|
|
|
if (error != 0) {
|
|
|
|
|
dmu_tx_abort(tx);
|
|
|
|
|
} else {
|
2022-09-24 02:52:03 +03:00
|
|
|
(void) zil_replaying(zv->zv_zilog, tx);
|
zvol: call zil_replaying() during replay
zil_replaying(zil, tx) has the side-effect of informing the ZIL that an
entry has been replayed in the (still open) tx. The ZIL uses that
information to record the replay progress in the ZIL header when that
tx's txg syncs.
ZPL log entries are not idempotent and logically dependent and thus
calling zil_replaying() is necessary for correctness.
For ZVOLs the question of correctness is more nuanced: ZVOL logs only
TX_WRITE and TX_TRUNCATE, both of which are idempotent. Logical
dependencies between two records exist only if the write or discard
request had sync semantics or if the ranges affected by the records
overlap.
Thus, at a first glance, it would be correct to restart replay from
the beginning if we crash before replay completes. But this does not
address the following scenario:
Assume one log record per LWB.
The chain on disk is
HDR -> 1:W(1, "A") -> 2:W(1, "B") -> 3:W(2, "X") -> 4:W(3, "Z")
where N:W(O, C) represents log entry number N which is a TX_WRITE of C
to offset A.
We replay 1, 2 and 3 in one txg, sync that txg, then crash.
Bit flips corrupt 2, 3, and 4.
We come up again and restart replay from the beginning because
we did not call zil_replaying() during replay.
We replay 1 again, then interpret 2's invalid checksum as the end
of the ZIL chain and call replay done.
The replayed zvol content is "AX".
If we had called zil_replaying() the HDR would have pointed to 3
and our resumed replay would not have replayed anything because
3 was corrupted, resulting in zvol content "BX".
If 3 logically depends on 2 then the replay corrupted the ZVOL_OBJ's
contents.
This patch adds the zil_replaying() calls to the replay functions.
Since the callbacks in the replay function need the zilog_t* pointer
so that they can call zil_replaying() we open the ZIL while
replaying in zvol_create_minor(). We also verify that replay has
been done when on-demand-opening the ZIL on the first modifying
bio.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #11667
2021-03-07 20:49:58 +03:00
|
|
|
dmu_tx_commit(tx);
|
|
|
|
|
error = dmu_free_long_range(zv->zv_objset, ZVOL_OBJ, offset,
|
|
|
|
|
length);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
return (error);
|
2015-08-02 16:01:14 +03:00
|
|
|
}
|
|
|
|
|
|
2010-08-26 22:45:02 +04:00
|
|
|
/*
|
|
|
|
|
* Replay a TX_WRITE ZIL transaction that didn't get committed
|
|
|
|
|
* after a system failure
|
|
|
|
|
*/
|
|
|
|
|
static int
|
2017-10-27 22:46:35 +03:00
|
|
|
zvol_replay_write(void *arg1, void *arg2, boolean_t byteswap)
|
2010-08-26 22:45:02 +04:00
|
|
|
{
|
2017-10-27 22:46:35 +03:00
|
|
|
zvol_state_t *zv = arg1;
|
|
|
|
|
lr_write_t *lr = arg2;
|
2010-08-26 22:45:02 +04:00
|
|
|
objset_t *os = zv->zv_objset;
|
2017-09-09 01:07:00 +03:00
|
|
|
char *data = (char *)(lr + 1); /* data follows lr_write_t */
|
|
|
|
|
uint64_t offset, length;
|
2010-08-26 22:45:02 +04:00
|
|
|
dmu_tx_t *tx;
|
|
|
|
|
int error;
|
|
|
|
|
|
2023-11-29 00:35:14 +03:00
|
|
|
ASSERT3U(lr->lr_common.lrc_reclen, >=, sizeof (*lr));
|
|
|
|
|
|
2010-08-26 22:45:02 +04:00
|
|
|
if (byteswap)
|
|
|
|
|
byteswap_uint64_array(lr, sizeof (*lr));
|
|
|
|
|
|
2017-09-09 01:07:00 +03:00
|
|
|
offset = lr->lr_offset;
|
|
|
|
|
length = lr->lr_length;
|
|
|
|
|
|
|
|
|
|
/* If it's a dmu_sync() block, write the whole block */
|
|
|
|
|
if (lr->lr_common.lrc_reclen == sizeof (lr_write_t)) {
|
|
|
|
|
uint64_t blocksize = BP_GET_LSIZE(&lr->lr_blkptr);
|
|
|
|
|
if (length < blocksize) {
|
|
|
|
|
offset -= offset % blocksize;
|
|
|
|
|
length = blocksize;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
2010-08-26 22:45:02 +04:00
|
|
|
tx = dmu_tx_create(os);
|
2017-09-09 01:07:00 +03:00
|
|
|
dmu_tx_hold_write(tx, ZVOL_OBJ, offset, length);
|
2025-03-19 02:04:22 +03:00
|
|
|
error = dmu_tx_assign(tx, DMU_TX_WAIT);
|
2010-08-26 22:45:02 +04:00
|
|
|
if (error) {
|
|
|
|
|
dmu_tx_abort(tx);
|
|
|
|
|
} else {
|
2017-09-09 01:07:00 +03:00
|
|
|
dmu_write(os, ZVOL_OBJ, offset, length, data, tx);
|
2022-09-24 02:52:03 +03:00
|
|
|
(void) zil_replaying(zv->zv_zilog, tx);
|
2010-08-26 22:45:02 +04:00
|
|
|
dmu_tx_commit(tx);
|
|
|
|
|
}
|
|
|
|
|
|
2017-09-09 01:07:00 +03:00
|
|
|
return (error);
|
2010-08-26 22:45:02 +04:00
|
|
|
}
|
|
|
|
|
|
2024-12-29 22:41:30 +03:00
|
|
|
/*
|
|
|
|
|
* Replay a TX_CLONE_RANGE ZIL transaction that didn't get committed
|
|
|
|
|
* after a system failure
|
|
|
|
|
*/
|
|
|
|
|
static int
|
|
|
|
|
zvol_replay_clone_range(void *arg1, void *arg2, boolean_t byteswap)
|
|
|
|
|
{
|
|
|
|
|
zvol_state_t *zv = arg1;
|
|
|
|
|
lr_clone_range_t *lr = arg2;
|
|
|
|
|
objset_t *os = zv->zv_objset;
|
|
|
|
|
dmu_tx_t *tx;
|
|
|
|
|
int error;
|
|
|
|
|
uint64_t blksz;
|
|
|
|
|
uint64_t off;
|
|
|
|
|
uint64_t len;
|
|
|
|
|
|
|
|
|
|
ASSERT3U(lr->lr_common.lrc_reclen, >=, sizeof (*lr));
|
|
|
|
|
ASSERT3U(lr->lr_common.lrc_reclen, >=, offsetof(lr_clone_range_t,
|
|
|
|
|
lr_bps[lr->lr_nbps]));
|
|
|
|
|
|
|
|
|
|
if (byteswap)
|
|
|
|
|
byteswap_uint64_array(lr, sizeof (*lr));
|
|
|
|
|
|
|
|
|
|
ASSERT(spa_feature_is_enabled(dmu_objset_spa(os),
|
|
|
|
|
SPA_FEATURE_BLOCK_CLONING));
|
|
|
|
|
|
|
|
|
|
off = lr->lr_offset;
|
|
|
|
|
len = lr->lr_length;
|
|
|
|
|
blksz = lr->lr_blksz;
|
|
|
|
|
|
|
|
|
|
if ((off % blksz) != 0) {
|
|
|
|
|
return (SET_ERROR(EINVAL));
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
error = dnode_hold(os, ZVOL_OBJ, zv, &zv->zv_dn);
|
|
|
|
|
if (error != 0 || !zv->zv_dn)
|
|
|
|
|
return (error);
|
|
|
|
|
tx = dmu_tx_create(os);
|
2025-06-11 21:59:16 +03:00
|
|
|
dmu_tx_hold_clone_by_dnode(tx, zv->zv_dn, off, len, blksz);
|
2025-03-19 02:04:22 +03:00
|
|
|
error = dmu_tx_assign(tx, DMU_TX_WAIT);
|
2024-12-29 22:41:30 +03:00
|
|
|
if (error != 0) {
|
|
|
|
|
dmu_tx_abort(tx);
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
|
|
|
|
error = dmu_brt_clone(zv->zv_objset, ZVOL_OBJ, off, len,
|
|
|
|
|
tx, lr->lr_bps, lr->lr_nbps);
|
|
|
|
|
if (error != 0) {
|
|
|
|
|
dmu_tx_commit(tx);
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* zil_replaying() not only check if we are replaying ZIL, but also
|
|
|
|
|
* updates the ZIL header to record replay progress.
|
|
|
|
|
*/
|
|
|
|
|
VERIFY(zil_replaying(zv->zv_zilog, tx));
|
|
|
|
|
dmu_tx_commit(tx);
|
|
|
|
|
|
|
|
|
|
out:
|
|
|
|
|
dnode_rele(zv->zv_dn, zv);
|
|
|
|
|
zv->zv_dn = NULL;
|
|
|
|
|
return (error);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
int
|
|
|
|
|
zvol_clone_range(zvol_state_t *zv_src, uint64_t inoff, zvol_state_t *zv_dst,
|
|
|
|
|
uint64_t outoff, uint64_t len)
|
|
|
|
|
{
|
|
|
|
|
zilog_t *zilog_dst;
|
|
|
|
|
zfs_locked_range_t *inlr, *outlr;
|
|
|
|
|
objset_t *inos, *outos;
|
|
|
|
|
dmu_tx_t *tx;
|
|
|
|
|
blkptr_t *bps;
|
|
|
|
|
size_t maxblocks;
|
2025-08-09 03:04:01 +03:00
|
|
|
int error = 0;
|
2024-12-29 22:41:30 +03:00
|
|
|
|
|
|
|
|
rw_enter(&zv_dst->zv_suspend_lock, RW_READER);
|
|
|
|
|
if (zv_dst->zv_zilog == NULL) {
|
|
|
|
|
rw_exit(&zv_dst->zv_suspend_lock);
|
|
|
|
|
rw_enter(&zv_dst->zv_suspend_lock, RW_WRITER);
|
|
|
|
|
if (zv_dst->zv_zilog == NULL) {
|
|
|
|
|
zv_dst->zv_zilog = zil_open(zv_dst->zv_objset,
|
|
|
|
|
zvol_get_data, &zv_dst->zv_kstat.dk_zil_sums);
|
|
|
|
|
zv_dst->zv_flags |= ZVOL_WRITTEN_TO;
|
|
|
|
|
VERIFY0((zv_dst->zv_zilog->zl_header->zh_flags &
|
|
|
|
|
ZIL_REPLAY_NEEDED));
|
|
|
|
|
}
|
|
|
|
|
rw_downgrade(&zv_dst->zv_suspend_lock);
|
|
|
|
|
}
|
|
|
|
|
if (zv_src != zv_dst)
|
|
|
|
|
rw_enter(&zv_src->zv_suspend_lock, RW_READER);
|
|
|
|
|
|
|
|
|
|
inos = zv_src->zv_objset;
|
|
|
|
|
outos = zv_dst->zv_objset;
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Sanity checks
|
|
|
|
|
*/
|
|
|
|
|
if (!spa_feature_is_enabled(dmu_objset_spa(outos),
|
|
|
|
|
SPA_FEATURE_BLOCK_CLONING)) {
|
2025-08-09 03:04:01 +03:00
|
|
|
error = SET_ERROR(EOPNOTSUPP);
|
2024-12-29 22:41:30 +03:00
|
|
|
goto out;
|
|
|
|
|
}
|
|
|
|
|
if (dmu_objset_spa(inos) != dmu_objset_spa(outos)) {
|
2025-08-09 03:04:01 +03:00
|
|
|
error = SET_ERROR(EXDEV);
|
2024-12-29 22:41:30 +03:00
|
|
|
goto out;
|
|
|
|
|
}
|
|
|
|
|
if (inos->os_encrypted != outos->os_encrypted) {
|
2025-08-09 03:04:01 +03:00
|
|
|
error = SET_ERROR(EXDEV);
|
2024-12-29 22:41:30 +03:00
|
|
|
goto out;
|
|
|
|
|
}
|
|
|
|
|
if (zv_src->zv_volblocksize != zv_dst->zv_volblocksize) {
|
2025-08-09 03:04:01 +03:00
|
|
|
error = SET_ERROR(EINVAL);
|
2024-12-29 22:41:30 +03:00
|
|
|
goto out;
|
|
|
|
|
}
|
|
|
|
|
if (inoff >= zv_src->zv_volsize || outoff >= zv_dst->zv_volsize) {
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Do not read beyond boundary
|
|
|
|
|
*/
|
|
|
|
|
if (len > zv_src->zv_volsize - inoff)
|
|
|
|
|
len = zv_src->zv_volsize - inoff;
|
|
|
|
|
if (len > zv_dst->zv_volsize - outoff)
|
|
|
|
|
len = zv_dst->zv_volsize - outoff;
|
2025-08-09 03:04:01 +03:00
|
|
|
if (len == 0)
|
2024-12-29 22:41:30 +03:00
|
|
|
goto out;
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* No overlapping if we are cloning within the same file
|
|
|
|
|
*/
|
|
|
|
|
if (zv_src == zv_dst) {
|
|
|
|
|
if (inoff < outoff + len && outoff < inoff + len) {
|
2025-08-09 03:04:01 +03:00
|
|
|
error = SET_ERROR(EINVAL);
|
2024-12-29 22:41:30 +03:00
|
|
|
goto out;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Offsets and length must be at block boundaries
|
|
|
|
|
*/
|
|
|
|
|
if ((inoff % zv_src->zv_volblocksize) != 0 ||
|
|
|
|
|
(outoff % zv_dst->zv_volblocksize) != 0) {
|
2025-08-09 03:04:01 +03:00
|
|
|
error = SET_ERROR(EINVAL);
|
2024-12-29 22:41:30 +03:00
|
|
|
goto out;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Length must be multiple of block size
|
|
|
|
|
*/
|
|
|
|
|
if ((len % zv_src->zv_volblocksize) != 0) {
|
2025-08-09 03:04:01 +03:00
|
|
|
error = SET_ERROR(EINVAL);
|
2024-12-29 22:41:30 +03:00
|
|
|
goto out;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
zilog_dst = zv_dst->zv_zilog;
|
|
|
|
|
maxblocks = zil_max_log_data(zilog_dst, sizeof (lr_clone_range_t)) /
|
|
|
|
|
sizeof (bps[0]);
|
|
|
|
|
bps = vmem_alloc(sizeof (bps[0]) * maxblocks, KM_SLEEP);
|
|
|
|
|
/*
|
|
|
|
|
* Maintain predictable lock order.
|
|
|
|
|
*/
|
|
|
|
|
if (zv_src < zv_dst || (zv_src == zv_dst && inoff < outoff)) {
|
|
|
|
|
inlr = zfs_rangelock_enter(&zv_src->zv_rangelock, inoff, len,
|
|
|
|
|
RL_READER);
|
|
|
|
|
outlr = zfs_rangelock_enter(&zv_dst->zv_rangelock, outoff, len,
|
|
|
|
|
RL_WRITER);
|
|
|
|
|
} else {
|
|
|
|
|
outlr = zfs_rangelock_enter(&zv_dst->zv_rangelock, outoff, len,
|
|
|
|
|
RL_WRITER);
|
|
|
|
|
inlr = zfs_rangelock_enter(&zv_src->zv_rangelock, inoff, len,
|
|
|
|
|
RL_READER);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
while (len > 0) {
|
|
|
|
|
uint64_t size, last_synced_txg;
|
|
|
|
|
size_t nbps = maxblocks;
|
|
|
|
|
size = MIN(zv_src->zv_volblocksize * maxblocks, len);
|
|
|
|
|
last_synced_txg = spa_last_synced_txg(
|
|
|
|
|
dmu_objset_spa(zv_src->zv_objset));
|
|
|
|
|
error = dmu_read_l0_bps(zv_src->zv_objset, ZVOL_OBJ, inoff,
|
|
|
|
|
size, bps, &nbps);
|
|
|
|
|
if (error != 0) {
|
|
|
|
|
/*
|
|
|
|
|
* If we are trying to clone a block that was created
|
|
|
|
|
* in the current transaction group, the error will be
|
|
|
|
|
* EAGAIN here. Based on zfs_bclone_wait_dirty either
|
|
|
|
|
* return a shortened range to the caller so it can
|
|
|
|
|
* fallback, or wait for the next TXG and check again.
|
|
|
|
|
*/
|
|
|
|
|
if (error == EAGAIN && zfs_bclone_wait_dirty) {
|
|
|
|
|
txg_wait_synced(dmu_objset_pool
|
|
|
|
|
(zv_src->zv_objset), last_synced_txg + 1);
|
|
|
|
|
continue;
|
|
|
|
|
}
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
tx = dmu_tx_create(zv_dst->zv_objset);
|
2025-06-11 21:59:16 +03:00
|
|
|
dmu_tx_hold_clone_by_dnode(tx, zv_dst->zv_dn, outoff, size,
|
|
|
|
|
zv_src->zv_volblocksize);
|
2025-03-19 02:04:22 +03:00
|
|
|
error = dmu_tx_assign(tx, DMU_TX_WAIT);
|
2024-12-29 22:41:30 +03:00
|
|
|
if (error != 0) {
|
|
|
|
|
dmu_tx_abort(tx);
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
error = dmu_brt_clone(zv_dst->zv_objset, ZVOL_OBJ, outoff, size,
|
|
|
|
|
tx, bps, nbps);
|
|
|
|
|
if (error != 0) {
|
|
|
|
|
dmu_tx_commit(tx);
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
zvol_log_clone_range(zilog_dst, tx, TX_CLONE_RANGE, outoff,
|
|
|
|
|
size, zv_src->zv_volblocksize, bps, nbps);
|
|
|
|
|
dmu_tx_commit(tx);
|
|
|
|
|
inoff += size;
|
|
|
|
|
outoff += size;
|
|
|
|
|
len -= size;
|
|
|
|
|
}
|
|
|
|
|
vmem_free(bps, sizeof (bps[0]) * maxblocks);
|
|
|
|
|
zfs_rangelock_exit(outlr);
|
|
|
|
|
zfs_rangelock_exit(inlr);
|
|
|
|
|
if (error == 0 && zv_dst->zv_objset->os_sync == ZFS_SYNC_ALWAYS) {
|
2025-02-24 07:14:23 +03:00
|
|
|
error = zil_commit(zilog_dst, ZVOL_OBJ);
|
2024-12-29 22:41:30 +03:00
|
|
|
}
|
|
|
|
|
out:
|
|
|
|
|
if (zv_src != zv_dst)
|
|
|
|
|
rw_exit(&zv_src->zv_suspend_lock);
|
|
|
|
|
rw_exit(&zv_dst->zv_suspend_lock);
|
2025-08-09 03:04:01 +03:00
|
|
|
return (error);
|
2024-12-29 22:41:30 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Handles TX_CLONE_RANGE transactions.
|
|
|
|
|
*/
|
|
|
|
|
void
|
|
|
|
|
zvol_log_clone_range(zilog_t *zilog, dmu_tx_t *tx, int txtype, uint64_t off,
|
|
|
|
|
uint64_t len, uint64_t blksz, const blkptr_t *bps, size_t nbps)
|
|
|
|
|
{
|
|
|
|
|
itx_t *itx;
|
|
|
|
|
lr_clone_range_t *lr;
|
|
|
|
|
uint64_t partlen, max_log_data;
|
|
|
|
|
size_t partnbps;
|
|
|
|
|
|
|
|
|
|
if (zil_replaying(zilog, tx))
|
|
|
|
|
return;
|
|
|
|
|
|
|
|
|
|
max_log_data = zil_max_log_data(zilog, sizeof (lr_clone_range_t));
|
|
|
|
|
|
|
|
|
|
while (nbps > 0) {
|
|
|
|
|
partnbps = MIN(nbps, max_log_data / sizeof (bps[0]));
|
|
|
|
|
partlen = partnbps * blksz;
|
|
|
|
|
ASSERT3U(partlen, <, len + blksz);
|
|
|
|
|
partlen = MIN(partlen, len);
|
|
|
|
|
|
|
|
|
|
itx = zil_itx_create(txtype,
|
|
|
|
|
sizeof (*lr) + sizeof (bps[0]) * partnbps);
|
|
|
|
|
lr = (lr_clone_range_t *)&itx->itx_lr;
|
|
|
|
|
lr->lr_foid = ZVOL_OBJ;
|
|
|
|
|
lr->lr_offset = off;
|
|
|
|
|
lr->lr_length = partlen;
|
|
|
|
|
lr->lr_blksz = blksz;
|
|
|
|
|
lr->lr_nbps = partnbps;
|
|
|
|
|
memcpy(lr->lr_bps, bps, sizeof (bps[0]) * partnbps);
|
|
|
|
|
|
|
|
|
|
zil_itx_assign(zilog, itx, tx);
|
|
|
|
|
|
|
|
|
|
bps += partnbps;
|
|
|
|
|
ASSERT3U(nbps, >=, partnbps);
|
|
|
|
|
nbps -= partnbps;
|
|
|
|
|
off += partlen;
|
|
|
|
|
ASSERT3U(len, >=, partlen);
|
|
|
|
|
len -= partlen;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
2010-08-26 22:45:02 +04:00
|
|
|
static int
|
2017-10-27 22:46:35 +03:00
|
|
|
zvol_replay_err(void *arg1, void *arg2, boolean_t byteswap)
|
2010-08-26 22:45:02 +04:00
|
|
|
{
|
2022-01-15 02:37:55 +03:00
|
|
|
(void) arg1, (void) arg2, (void) byteswap;
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(ENOTSUP));
|
2010-08-26 22:45:02 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Callback vectors for replaying records.
|
2015-08-02 16:01:14 +03:00
|
|
|
* Only TX_WRITE and TX_TRUNCATE are needed for zvol.
|
2010-08-26 22:45:02 +04:00
|
|
|
*/
|
2022-01-15 02:37:55 +03:00
|
|
|
zil_replay_func_t *const zvol_replay_vector[TX_MAX_TYPE] = {
|
2017-10-27 22:46:35 +03:00
|
|
|
zvol_replay_err, /* no such transaction type */
|
|
|
|
|
zvol_replay_err, /* TX_CREATE */
|
|
|
|
|
zvol_replay_err, /* TX_MKDIR */
|
|
|
|
|
zvol_replay_err, /* TX_MKXATTR */
|
|
|
|
|
zvol_replay_err, /* TX_SYMLINK */
|
|
|
|
|
zvol_replay_err, /* TX_REMOVE */
|
|
|
|
|
zvol_replay_err, /* TX_RMDIR */
|
|
|
|
|
zvol_replay_err, /* TX_LINK */
|
|
|
|
|
zvol_replay_err, /* TX_RENAME */
|
|
|
|
|
zvol_replay_write, /* TX_WRITE */
|
|
|
|
|
zvol_replay_truncate, /* TX_TRUNCATE */
|
|
|
|
|
zvol_replay_err, /* TX_SETATTR */
|
2024-12-29 22:41:30 +03:00
|
|
|
zvol_replay_err, /* TX_ACL_V0 */
|
2017-10-27 22:46:35 +03:00
|
|
|
zvol_replay_err, /* TX_ACL */
|
2024-12-29 22:41:30 +03:00
|
|
|
zvol_replay_err, /* TX_CREATE_ACL */
|
2017-10-27 22:46:35 +03:00
|
|
|
zvol_replay_err, /* TX_CREATE_ATTR */
|
|
|
|
|
zvol_replay_err, /* TX_CREATE_ACL_ATTR */
|
|
|
|
|
zvol_replay_err, /* TX_MKDIR_ACL */
|
|
|
|
|
zvol_replay_err, /* TX_MKDIR_ATTR */
|
|
|
|
|
zvol_replay_err, /* TX_MKDIR_ACL_ATTR */
|
|
|
|
|
zvol_replay_err, /* TX_WRITE2 */
|
2022-05-02 21:01:26 +03:00
|
|
|
zvol_replay_err, /* TX_SETSAXATTR */
|
2019-06-22 03:35:11 +03:00
|
|
|
zvol_replay_err, /* TX_RENAME_EXCHANGE */
|
|
|
|
|
zvol_replay_err, /* TX_RENAME_WHITEOUT */
|
2024-12-29 22:41:30 +03:00
|
|
|
zvol_replay_clone_range, /* TX_CLONE_RANGE */
|
2010-08-26 22:45:02 +04:00
|
|
|
};
|
|
|
|
|
|
|
|
|
|
/*
|
2025-07-19 04:44:14 +03:00
|
|
|
* zvol_log_write() handles TX_WRITE transactions.
|
2010-08-26 22:45:02 +04:00
|
|
|
*/
|
2019-09-25 19:20:30 +03:00
|
|
|
void
|
2013-12-13 01:04:40 +04:00
|
|
|
zvol_log_write(zvol_state_t *zv, dmu_tx_t *tx, uint64_t offset,
|
2023-10-31 00:51:56 +03:00
|
|
|
uint64_t size, boolean_t commit)
|
2010-08-26 22:45:02 +04:00
|
|
|
{
|
|
|
|
|
uint32_t blocksize = zv->zv_volblocksize;
|
|
|
|
|
zilog_t *zilog = zv->zv_zilog;
|
OpenZFS 7578 - Fix/improve some aspects of ZIL writing
- After some ZIL changes 6 years ago zil_slog_limit got partially broken
due to zl_itx_list_sz not updated when async itx'es upgraded to sync.
Actually because of other changes about that time zl_itx_list_sz is not
really required to implement the functionality, so this patch removes
some unneeded broken code and variables.
- Original idea of zil_slog_limit was to reduce chance of SLOG abuse by
single heavy logger, that increased latency for other (more latency critical)
loggers, by pushing heavy log out into the main pool instead of SLOG. Beside
huge latency increase for heavy writers, this implementation caused double
write of all data, since the log records were explicitly prepared for SLOG.
Since we now have I/O scheduler, I've found it can be much more efficient
to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE
to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG.
- Existing ZIL implementation had problem with space efficiency when it
has to write large chunks of data into log blocks of limited size. In some
cases efficiency stopped to almost as low as 50%. In case of ZIL stored on
spinning rust, that also reduced log write speed in half, since head had to
uselessly fly over allocated but not written areas. This change improves
the situation by offloading problematic operations from z*_log_write() to
zil_lwb_commit(), which knows real situation of log blocks allocation and
can split large requests into pieces much more efficiently. Also as side
effect it removes one of two data copy operations done by ZIL code WR_COPIED
case.
- While there, untangle and unify code of z*_log_write() functions.
Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing
block boundary, that may also improve efficiency if ZPL is made to do that.
Sponsored by: iXsystems, Inc.
Authored by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/7578
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/aeb13ac
Closes #6191
2017-06-09 19:15:37 +03:00
|
|
|
itx_wr_state_t write_state;
|
2025-05-24 04:48:46 +03:00
|
|
|
uint64_t log_size = 0;
|
2010-08-26 22:45:02 +04:00
|
|
|
|
|
|
|
|
if (zil_replaying(zilog, tx))
|
|
|
|
|
return;
|
|
|
|
|
|
2025-07-19 04:44:14 +03:00
|
|
|
write_state = zil_write_state(zilog, size, blocksize, B_FALSE, commit);
|
2010-08-26 22:45:02 +04:00
|
|
|
|
|
|
|
|
while (size) {
|
|
|
|
|
itx_t *itx;
|
|
|
|
|
lr_write_t *lr;
|
OpenZFS 7578 - Fix/improve some aspects of ZIL writing
- After some ZIL changes 6 years ago zil_slog_limit got partially broken
due to zl_itx_list_sz not updated when async itx'es upgraded to sync.
Actually because of other changes about that time zl_itx_list_sz is not
really required to implement the functionality, so this patch removes
some unneeded broken code and variables.
- Original idea of zil_slog_limit was to reduce chance of SLOG abuse by
single heavy logger, that increased latency for other (more latency critical)
loggers, by pushing heavy log out into the main pool instead of SLOG. Beside
huge latency increase for heavy writers, this implementation caused double
write of all data, since the log records were explicitly prepared for SLOG.
Since we now have I/O scheduler, I've found it can be much more efficient
to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE
to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG.
- Existing ZIL implementation had problem with space efficiency when it
has to write large chunks of data into log blocks of limited size. In some
cases efficiency stopped to almost as low as 50%. In case of ZIL stored on
spinning rust, that also reduced log write speed in half, since head had to
uselessly fly over allocated but not written areas. This change improves
the situation by offloading problematic operations from z*_log_write() to
zil_lwb_commit(), which knows real situation of log blocks allocation and
can split large requests into pieces much more efficiently. Also as side
effect it removes one of two data copy operations done by ZIL code WR_COPIED
case.
- While there, untangle and unify code of z*_log_write() functions.
Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing
block boundary, that may also improve efficiency if ZPL is made to do that.
Sponsored by: iXsystems, Inc.
Authored by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/7578
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/aeb13ac
Closes #6191
2017-06-09 19:15:37 +03:00
|
|
|
itx_wr_state_t wr_state = write_state;
|
|
|
|
|
ssize_t len = size;
|
2010-08-26 22:45:02 +04:00
|
|
|
|
2019-06-10 21:48:42 +03:00
|
|
|
if (wr_state == WR_COPIED && size > zil_max_copied_data(zilog))
|
OpenZFS 7578 - Fix/improve some aspects of ZIL writing
- After some ZIL changes 6 years ago zil_slog_limit got partially broken
due to zl_itx_list_sz not updated when async itx'es upgraded to sync.
Actually because of other changes about that time zl_itx_list_sz is not
really required to implement the functionality, so this patch removes
some unneeded broken code and variables.
- Original idea of zil_slog_limit was to reduce chance of SLOG abuse by
single heavy logger, that increased latency for other (more latency critical)
loggers, by pushing heavy log out into the main pool instead of SLOG. Beside
huge latency increase for heavy writers, this implementation caused double
write of all data, since the log records were explicitly prepared for SLOG.
Since we now have I/O scheduler, I've found it can be much more efficient
to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE
to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG.
- Existing ZIL implementation had problem with space efficiency when it
has to write large chunks of data into log blocks of limited size. In some
cases efficiency stopped to almost as low as 50%. In case of ZIL stored on
spinning rust, that also reduced log write speed in half, since head had to
uselessly fly over allocated but not written areas. This change improves
the situation by offloading problematic operations from z*_log_write() to
zil_lwb_commit(), which knows real situation of log blocks allocation and
can split large requests into pieces much more efficiently. Also as side
effect it removes one of two data copy operations done by ZIL code WR_COPIED
case.
- While there, untangle and unify code of z*_log_write() functions.
Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing
block boundary, that may also improve efficiency if ZPL is made to do that.
Sponsored by: iXsystems, Inc.
Authored by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/7578
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/aeb13ac
Closes #6191
2017-06-09 19:15:37 +03:00
|
|
|
wr_state = WR_NEED_COPY;
|
|
|
|
|
else if (wr_state == WR_INDIRECT)
|
|
|
|
|
len = MIN(blocksize - P2PHASE(offset, blocksize), size);
|
2010-08-26 22:45:02 +04:00
|
|
|
|
|
|
|
|
itx = zil_itx_create(TX_WRITE, sizeof (*lr) +
|
OpenZFS 7578 - Fix/improve some aspects of ZIL writing
- After some ZIL changes 6 years ago zil_slog_limit got partially broken
due to zl_itx_list_sz not updated when async itx'es upgraded to sync.
Actually because of other changes about that time zl_itx_list_sz is not
really required to implement the functionality, so this patch removes
some unneeded broken code and variables.
- Original idea of zil_slog_limit was to reduce chance of SLOG abuse by
single heavy logger, that increased latency for other (more latency critical)
loggers, by pushing heavy log out into the main pool instead of SLOG. Beside
huge latency increase for heavy writers, this implementation caused double
write of all data, since the log records were explicitly prepared for SLOG.
Since we now have I/O scheduler, I've found it can be much more efficient
to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE
to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG.
- Existing ZIL implementation had problem with space efficiency when it
has to write large chunks of data into log blocks of limited size. In some
cases efficiency stopped to almost as low as 50%. In case of ZIL stored on
spinning rust, that also reduced log write speed in half, since head had to
uselessly fly over allocated but not written areas. This change improves
the situation by offloading problematic operations from z*_log_write() to
zil_lwb_commit(), which knows real situation of log blocks allocation and
can split large requests into pieces much more efficiently. Also as side
effect it removes one of two data copy operations done by ZIL code WR_COPIED
case.
- While there, untangle and unify code of z*_log_write() functions.
Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing
block boundary, that may also improve efficiency if ZPL is made to do that.
Sponsored by: iXsystems, Inc.
Authored by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/7578
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/aeb13ac
Closes #6191
2017-06-09 19:15:37 +03:00
|
|
|
(wr_state == WR_COPIED ? len : 0));
|
2010-08-26 22:45:02 +04:00
|
|
|
lr = (lr_write_t *)&itx->itx_lr;
|
Wire O_DIRECT also to Uncached I/O (#17218)
Before Direct I/O was implemented, I've implemented lighter version
I called Uncached I/O. It uses normal DMU/ARC data path with some
optimizations, but evicts data from caches as soon as possible and
reasonable. Originally I wired it only to a primarycache property,
but now completing the integration all the way up to the VFS.
While Direct I/O has the lowest possible memory bandwidth usage,
it also has a significant number of limitations. It require I/Os
to be page aligned, does not allow speculative prefetch, etc. The
Uncached I/O does not have those limitations, but instead require
additional memory copy, though still one less than regular cached
I/O. As such it should fill the gap in between. Considering this
I've disabled annoying EINVAL errors on misaligned requests, adding
a tunable for those who wants to test their applications.
To pass the information between the layers I had to change a number
of APIs. But as side effect upper layers can now control not only
the caching, but also speculative prefetch. I haven't wired it to
VFS yet, since it require looking on some OS specifics. But while
there I've implemented speculative prefetch of indirect blocks for
Direct I/O, controllable via all the same mechanisms.
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Fixes #17027
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-05-14 00:26:55 +03:00
|
|
|
if (wr_state == WR_COPIED &&
|
|
|
|
|
dmu_read_by_dnode(zv->zv_dn, offset, len, lr + 1,
|
|
|
|
|
DMU_READ_NO_PREFETCH | DMU_KEEP_CACHING) != 0) {
|
2025-02-27 06:20:56 +03:00
|
|
|
zil_itx_destroy(itx, 0);
|
2010-08-26 22:45:02 +04:00
|
|
|
itx = zil_itx_create(TX_WRITE, sizeof (*lr));
|
|
|
|
|
lr = (lr_write_t *)&itx->itx_lr;
|
OpenZFS 7578 - Fix/improve some aspects of ZIL writing
- After some ZIL changes 6 years ago zil_slog_limit got partially broken
due to zl_itx_list_sz not updated when async itx'es upgraded to sync.
Actually because of other changes about that time zl_itx_list_sz is not
really required to implement the functionality, so this patch removes
some unneeded broken code and variables.
- Original idea of zil_slog_limit was to reduce chance of SLOG abuse by
single heavy logger, that increased latency for other (more latency critical)
loggers, by pushing heavy log out into the main pool instead of SLOG. Beside
huge latency increase for heavy writers, this implementation caused double
write of all data, since the log records were explicitly prepared for SLOG.
Since we now have I/O scheduler, I've found it can be much more efficient
to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE
to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG.
- Existing ZIL implementation had problem with space efficiency when it
has to write large chunks of data into log blocks of limited size. In some
cases efficiency stopped to almost as low as 50%. In case of ZIL stored on
spinning rust, that also reduced log write speed in half, since head had to
uselessly fly over allocated but not written areas. This change improves
the situation by offloading problematic operations from z*_log_write() to
zil_lwb_commit(), which knows real situation of log blocks allocation and
can split large requests into pieces much more efficiently. Also as side
effect it removes one of two data copy operations done by ZIL code WR_COPIED
case.
- While there, untangle and unify code of z*_log_write() functions.
Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing
block boundary, that may also improve efficiency if ZPL is made to do that.
Sponsored by: iXsystems, Inc.
Authored by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/7578
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/aeb13ac
Closes #6191
2017-06-09 19:15:37 +03:00
|
|
|
wr_state = WR_NEED_COPY;
|
2010-08-26 22:45:02 +04:00
|
|
|
}
|
|
|
|
|
|
2025-05-24 04:48:46 +03:00
|
|
|
log_size += itx->itx_size;
|
|
|
|
|
if (wr_state == WR_NEED_COPY)
|
|
|
|
|
log_size += len;
|
|
|
|
|
|
OpenZFS 7578 - Fix/improve some aspects of ZIL writing
- After some ZIL changes 6 years ago zil_slog_limit got partially broken
due to zl_itx_list_sz not updated when async itx'es upgraded to sync.
Actually because of other changes about that time zl_itx_list_sz is not
really required to implement the functionality, so this patch removes
some unneeded broken code and variables.
- Original idea of zil_slog_limit was to reduce chance of SLOG abuse by
single heavy logger, that increased latency for other (more latency critical)
loggers, by pushing heavy log out into the main pool instead of SLOG. Beside
huge latency increase for heavy writers, this implementation caused double
write of all data, since the log records were explicitly prepared for SLOG.
Since we now have I/O scheduler, I've found it can be much more efficient
to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE
to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG.
- Existing ZIL implementation had problem with space efficiency when it
has to write large chunks of data into log blocks of limited size. In some
cases efficiency stopped to almost as low as 50%. In case of ZIL stored on
spinning rust, that also reduced log write speed in half, since head had to
uselessly fly over allocated but not written areas. This change improves
the situation by offloading problematic operations from z*_log_write() to
zil_lwb_commit(), which knows real situation of log blocks allocation and
can split large requests into pieces much more efficiently. Also as side
effect it removes one of two data copy operations done by ZIL code WR_COPIED
case.
- While there, untangle and unify code of z*_log_write() functions.
Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing
block boundary, that may also improve efficiency if ZPL is made to do that.
Sponsored by: iXsystems, Inc.
Authored by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/7578
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/aeb13ac
Closes #6191
2017-06-09 19:15:37 +03:00
|
|
|
itx->itx_wr_state = wr_state;
|
2010-08-26 22:45:02 +04:00
|
|
|
lr->lr_foid = ZVOL_OBJ;
|
|
|
|
|
lr->lr_offset = offset;
|
|
|
|
|
lr->lr_length = len;
|
|
|
|
|
lr->lr_blkoff = 0;
|
|
|
|
|
BP_ZERO(&lr->lr_blkptr);
|
|
|
|
|
|
|
|
|
|
itx->itx_private = zv;
|
|
|
|
|
|
2025-08-05 05:32:33 +03:00
|
|
|
zil_itx_assign(zilog, itx, tx);
|
2010-08-26 22:45:02 +04:00
|
|
|
|
|
|
|
|
offset += len;
|
|
|
|
|
size -= len;
|
|
|
|
|
}
|
2021-07-20 18:40:24 +03:00
|
|
|
|
2025-05-24 04:48:46 +03:00
|
|
|
dsl_pool_wrlog_count(zilog->zl_dmu_pool, log_size, tx->tx_txg);
|
2010-08-26 22:45:02 +04:00
|
|
|
}
|
|
|
|
|
|
2015-08-02 16:01:14 +03:00
|
|
|
/*
|
|
|
|
|
* Log a DKIOCFREE/free-long-range to the ZIL with TX_TRUNCATE.
|
|
|
|
|
*/
|
2019-09-25 19:20:30 +03:00
|
|
|
void
|
2023-10-31 00:51:56 +03:00
|
|
|
zvol_log_truncate(zvol_state_t *zv, dmu_tx_t *tx, uint64_t off, uint64_t len)
|
2015-08-02 16:01:14 +03:00
|
|
|
{
|
|
|
|
|
itx_t *itx;
|
|
|
|
|
lr_truncate_t *lr;
|
|
|
|
|
zilog_t *zilog = zv->zv_zilog;
|
|
|
|
|
|
|
|
|
|
if (zil_replaying(zilog, tx))
|
|
|
|
|
return;
|
|
|
|
|
|
|
|
|
|
itx = zil_itx_create(TX_TRUNCATE, sizeof (*lr));
|
|
|
|
|
lr = (lr_truncate_t *)&itx->itx_lr;
|
|
|
|
|
lr->lr_foid = ZVOL_OBJ;
|
|
|
|
|
lr->lr_offset = off;
|
|
|
|
|
lr->lr_length = len;
|
|
|
|
|
|
|
|
|
|
zil_itx_assign(zilog, itx, tx);
|
|
|
|
|
}
|
|
|
|
|
|
2010-08-26 22:45:02 +04:00
|
|
|
|
2019-05-05 02:39:10 +03:00
|
|
|
static void
|
|
|
|
|
zvol_get_done(zgd_t *zgd, int error)
|
|
|
|
|
{
|
2022-02-16 04:38:43 +03:00
|
|
|
(void) error;
|
2019-05-05 02:39:10 +03:00
|
|
|
if (zgd->zgd_db)
|
|
|
|
|
dmu_buf_rele(zgd->zgd_db, zgd);
|
|
|
|
|
|
2019-10-04 01:54:29 +03:00
|
|
|
zfs_rangelock_exit(zgd->zgd_lr);
|
2019-05-05 02:39:10 +03:00
|
|
|
|
|
|
|
|
kmem_free(zgd, sizeof (zgd_t));
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Get data to generate a TX_WRITE intent log record.
|
|
|
|
|
*/
|
2019-09-25 19:20:30 +03:00
|
|
|
int
|
2021-03-20 08:53:31 +03:00
|
|
|
zvol_get_data(void *arg, uint64_t arg2, lr_write_t *lr, char *buf,
|
|
|
|
|
struct lwb *lwb, zio_t *zio)
|
2019-05-05 02:39:10 +03:00
|
|
|
{
|
|
|
|
|
zvol_state_t *zv = arg;
|
|
|
|
|
uint64_t offset = lr->lr_offset;
|
|
|
|
|
uint64_t size = lr->lr_length;
|
|
|
|
|
dmu_buf_t *db;
|
|
|
|
|
zgd_t *zgd;
|
|
|
|
|
int error;
|
|
|
|
|
|
|
|
|
|
ASSERT3P(lwb, !=, NULL);
|
|
|
|
|
ASSERT3U(size, !=, 0);
|
|
|
|
|
|
2023-01-10 22:03:46 +03:00
|
|
|
zgd = kmem_zalloc(sizeof (zgd_t), KM_SLEEP);
|
2019-05-05 02:39:10 +03:00
|
|
|
zgd->zgd_lwb = lwb;
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Write records come in two flavors: immediate and indirect.
|
|
|
|
|
* For small writes it's cheaper to store the data with the
|
|
|
|
|
* log record (immediate); for large writes it's cheaper to
|
|
|
|
|
* sync the data and get a pointer to it (indirect) so that
|
|
|
|
|
* we don't have to write the data twice.
|
|
|
|
|
*/
|
|
|
|
|
if (buf != NULL) { /* immediate write */
|
2019-10-04 01:54:29 +03:00
|
|
|
zgd->zgd_lr = zfs_rangelock_enter(&zv->zv_rangelock, offset,
|
|
|
|
|
size, RL_READER);
|
2019-05-05 02:39:10 +03:00
|
|
|
error = dmu_read_by_dnode(zv->zv_dn, offset, size, buf,
|
Wire O_DIRECT also to Uncached I/O (#17218)
Before Direct I/O was implemented, I've implemented lighter version
I called Uncached I/O. It uses normal DMU/ARC data path with some
optimizations, but evicts data from caches as soon as possible and
reasonable. Originally I wired it only to a primarycache property,
but now completing the integration all the way up to the VFS.
While Direct I/O has the lowest possible memory bandwidth usage,
it also has a significant number of limitations. It require I/Os
to be page aligned, does not allow speculative prefetch, etc. The
Uncached I/O does not have those limitations, but instead require
additional memory copy, though still one less than regular cached
I/O. As such it should fill the gap in between. Considering this
I've disabled annoying EINVAL errors on misaligned requests, adding
a tunable for those who wants to test their applications.
To pass the information between the layers I had to change a number
of APIs. But as side effect upper layers can now control not only
the caching, but also speculative prefetch. I haven't wired it to
VFS yet, since it require looking on some OS specifics. But while
there I've implemented speculative prefetch of indirect blocks for
Direct I/O, controllable via all the same mechanisms.
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Fixes #17027
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-05-14 00:26:55 +03:00
|
|
|
DMU_READ_NO_PREFETCH | DMU_KEEP_CACHING);
|
2019-05-05 02:39:10 +03:00
|
|
|
} else { /* indirect write */
|
ZIL: Second attempt to reduce scope of zl_issuer_lock.
The previous patch #14841 appeared to have significant flaw, causing
deadlocks if zl_get_data callback got blocked waiting for TXG sync. I
already handled some of such cases in the original patch, but issue
#14982 shown cases that were impossible to solve in that design.
This patch fixes the problem by postponing log blocks allocation till
the very end, just before the zios issue, leaving nothing blocking after
that point to cause deadlocks. Before that point though any sleeps are
now allowed, not causing sync thread blockage. This require slightly
more complicated lwb state machine to allocate blocks and issue zios
in proper order. But with removal of special early issue workarounds
the new code is much cleaner now, and should even be more efficient.
Since this patch uses null zios between write, I've found that null
zios do not wait for logical children ready status in zio_ready(),
that makes parent write to proceed prematurely, producing incorrect
log blocks. Added ZIO_CHILD_LOGICAL_BIT to zio_wait_for_children()
fixes it.
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15122
2023-08-25 03:08:49 +03:00
|
|
|
ASSERT3P(zio, !=, NULL);
|
2019-05-05 02:39:10 +03:00
|
|
|
/*
|
|
|
|
|
* Have to lock the whole block to ensure when it's written out
|
|
|
|
|
* and its checksum is being calculated that no one can change
|
|
|
|
|
* the data. Contrarily to zfs_get_data we need not re-check
|
|
|
|
|
* blocksize after we get the lock because it cannot be changed.
|
|
|
|
|
*/
|
|
|
|
|
size = zv->zv_volblocksize;
|
|
|
|
|
offset = P2ALIGN_TYPED(offset, size, uint64_t);
|
2019-10-04 01:54:29 +03:00
|
|
|
zgd->zgd_lr = zfs_rangelock_enter(&zv->zv_rangelock, offset,
|
|
|
|
|
size, RL_READER);
|
2023-08-11 19:04:08 +03:00
|
|
|
error = dmu_buf_hold_noread_by_dnode(zv->zv_dn, offset, zgd,
|
|
|
|
|
&db);
|
2019-05-05 02:39:10 +03:00
|
|
|
if (error == 0) {
|
|
|
|
|
blkptr_t *bp = &lr->lr_blkptr;
|
|
|
|
|
|
|
|
|
|
zgd->zgd_db = db;
|
|
|
|
|
zgd->zgd_bp = bp;
|
|
|
|
|
|
|
|
|
|
ASSERT(db != NULL);
|
|
|
|
|
ASSERT(db->db_offset == offset);
|
|
|
|
|
ASSERT(db->db_size == size);
|
|
|
|
|
|
|
|
|
|
error = dmu_sync(zio, lr->lr_common.lrc_txg,
|
|
|
|
|
zvol_get_done, zgd);
|
|
|
|
|
|
|
|
|
|
if (error == 0)
|
|
|
|
|
return (0);
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
zvol_get_done(zgd, error);
|
|
|
|
|
|
2025-08-09 03:04:01 +03:00
|
|
|
return (error);
|
2019-05-05 02:39:10 +03:00
|
|
|
}
|
|
|
|
|
|
2010-08-26 22:45:02 +04:00
|
|
|
/*
|
2016-12-01 00:56:50 +03:00
|
|
|
* The zvol_state_t's are inserted into zvol_state_list and zvol_htable.
|
2010-08-26 22:45:02 +04:00
|
|
|
*/
|
2019-09-25 19:20:30 +03:00
|
|
|
|
|
|
|
|
void
|
2016-12-01 00:56:50 +03:00
|
|
|
zvol_insert(zvol_state_t *zv)
|
2010-08-26 22:45:02 +04:00
|
|
|
{
|
2018-06-16 01:05:21 +03:00
|
|
|
ASSERT(RW_WRITE_HELD(&zvol_state_lock));
|
2016-12-01 00:56:50 +03:00
|
|
|
list_insert_head(&zvol_state_list, zv);
|
|
|
|
|
hlist_add_head(&zv->zv_hlink, ZVOL_HT_HEAD(zv->zv_hash));
|
2010-08-26 22:45:02 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Simply remove the zvol from to list of zvols.
|
|
|
|
|
*/
|
|
|
|
|
static void
|
2016-12-01 00:56:50 +03:00
|
|
|
zvol_remove(zvol_state_t *zv)
|
2010-08-26 22:45:02 +04:00
|
|
|
{
|
2018-06-16 01:05:21 +03:00
|
|
|
ASSERT(RW_WRITE_HELD(&zvol_state_lock));
|
2016-12-01 00:56:50 +03:00
|
|
|
list_remove(&zvol_state_list, zv);
|
|
|
|
|
hlist_del(&zv->zv_hlink);
|
2010-08-26 22:45:02 +04:00
|
|
|
}
|
|
|
|
|
|
2017-01-20 00:56:36 +03:00
|
|
|
/*
|
|
|
|
|
* Setup zv after we just own the zv->objset
|
|
|
|
|
*/
|
2010-08-26 22:45:02 +04:00
|
|
|
static int
|
2017-01-20 00:56:36 +03:00
|
|
|
zvol_setup_zv(zvol_state_t *zv)
|
2010-08-26 22:45:02 +04:00
|
|
|
{
|
|
|
|
|
uint64_t volsize;
|
|
|
|
|
int error;
|
|
|
|
|
uint64_t ro;
|
2017-01-20 00:56:36 +03:00
|
|
|
objset_t *os = zv->zv_objset;
|
2015-09-23 19:34:51 +03:00
|
|
|
|
2018-06-16 01:05:21 +03:00
|
|
|
ASSERT(MUTEX_HELD(&zv->zv_state_lock));
|
|
|
|
|
ASSERT(RW_LOCK_HELD(&zv->zv_suspend_lock));
|
2017-06-13 19:03:44 +03:00
|
|
|
|
2019-05-05 02:39:10 +03:00
|
|
|
zv->zv_zilog = NULL;
|
|
|
|
|
zv->zv_flags &= ~ZVOL_WRITTEN_TO;
|
|
|
|
|
|
2015-09-23 19:34:51 +03:00
|
|
|
error = dsl_prop_get_integer(zv->zv_name, "readonly", &ro, NULL);
|
|
|
|
|
if (error)
|
2025-08-09 03:04:01 +03:00
|
|
|
return (error);
|
2010-08-26 22:45:02 +04:00
|
|
|
|
|
|
|
|
error = zap_lookup(os, ZVOL_ZAP_OBJ, "size", 8, 1, &volsize);
|
2015-09-23 19:34:51 +03:00
|
|
|
if (error)
|
2025-08-09 03:04:01 +03:00
|
|
|
return (error);
|
2010-08-26 22:45:02 +04:00
|
|
|
|
2020-11-10 21:37:10 +03:00
|
|
|
error = dnode_hold(os, ZVOL_OBJ, zv, &zv->zv_dn);
|
2015-09-23 19:34:51 +03:00
|
|
|
if (error)
|
2025-08-09 03:04:01 +03:00
|
|
|
return (error);
|
2010-08-26 22:45:02 +04:00
|
|
|
|
2022-02-07 21:24:38 +03:00
|
|
|
zvol_os_set_capacity(zv, volsize >> 9);
|
2010-08-26 22:45:02 +04:00
|
|
|
zv->zv_volsize = volsize;
|
|
|
|
|
|
2013-03-03 09:57:39 +04:00
|
|
|
if (ro || dmu_objset_is_snapshot(os) ||
|
|
|
|
|
!spa_writeable(dmu_objset_spa(os))) {
|
2022-02-07 21:24:38 +03:00
|
|
|
zvol_os_set_disk_ro(zv, 1);
|
2013-01-18 21:44:09 +04:00
|
|
|
zv->zv_flags |= ZVOL_RDONLY;
|
2010-08-26 22:45:02 +04:00
|
|
|
} else {
|
2022-02-07 21:24:38 +03:00
|
|
|
zvol_os_set_disk_ro(zv, 0);
|
2013-01-18 21:44:09 +04:00
|
|
|
zv->zv_flags &= ~ZVOL_RDONLY;
|
2010-08-26 22:45:02 +04:00
|
|
|
}
|
2017-01-20 00:56:36 +03:00
|
|
|
return (0);
|
2010-08-26 22:45:02 +04:00
|
|
|
}
|
|
|
|
|
|
2017-01-20 00:56:36 +03:00
|
|
|
/*
|
|
|
|
|
* Shutdown every zv_objset related stuff except zv_objset itself.
|
|
|
|
|
* The is the reverse of zvol_setup_zv.
|
|
|
|
|
*/
|
2010-08-26 22:45:02 +04:00
|
|
|
static void
|
2017-01-20 00:56:36 +03:00
|
|
|
zvol_shutdown_zv(zvol_state_t *zv)
|
2010-08-26 22:45:02 +04:00
|
|
|
{
|
2017-06-13 19:03:44 +03:00
|
|
|
ASSERT(MUTEX_HELD(&zv->zv_state_lock) &&
|
|
|
|
|
RW_LOCK_HELD(&zv->zv_suspend_lock));
|
|
|
|
|
|
2019-05-05 02:39:10 +03:00
|
|
|
if (zv->zv_flags & ZVOL_WRITTEN_TO) {
|
|
|
|
|
ASSERT(zv->zv_zilog != NULL);
|
|
|
|
|
zil_close(zv->zv_zilog);
|
|
|
|
|
}
|
|
|
|
|
|
2010-08-26 22:45:02 +04:00
|
|
|
zv->zv_zilog = NULL;
|
2012-08-24 18:12:46 +04:00
|
|
|
|
2020-11-10 21:37:10 +03:00
|
|
|
dnode_rele(zv->zv_dn, zv);
|
2017-06-13 19:18:08 +03:00
|
|
|
zv->zv_dn = NULL;
|
2012-08-24 18:12:46 +04:00
|
|
|
|
|
|
|
|
/*
|
Always wait for txg sync when umounting dataset
Currently, when unmounting a filesystem, ZFS will only wait for
a txg sync if the dataset is dirty and not readonly. However, this
can be problematic in cases where a dataset is remounted readonly
immediately before being unmounted, which often happens when the
system is being shut down. Since encrypted datasets require that
all I/O is completed before the dataset is disowned, this issue
causes problems when write I/Os leak into the txgs after the
dataset is disowned, which can happen when sync=disabled.
While looking into fixes for this issue, it was discovered that
dsl_dataset_is_dirty() does not return B_TRUE when the dataset has
been removed from the txg dirty datasets list, but has not actually
been processed yet. Furthermore, the implementation is comletely
different from dmu_objset_is_dirty(), adding to the confusion.
Rather than relying on this function, this patch forces the umount
code path (and the remount readonly code path) to always perform a
txg sync on read-write datasets and removes the function altogether.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7753
Closes #7795
2018-08-20 23:42:17 +03:00
|
|
|
* Evict cached data. We must write out any dirty data before
|
|
|
|
|
* disowning the dataset.
|
2012-08-24 18:12:46 +04:00
|
|
|
*/
|
2019-05-05 02:39:10 +03:00
|
|
|
if (zv->zv_flags & ZVOL_WRITTEN_TO)
|
2012-08-24 18:12:46 +04:00
|
|
|
txg_wait_synced(dmu_objset_pool(zv->zv_objset), 0);
|
2025-08-05 05:32:33 +03:00
|
|
|
dmu_objset_evict_dbufs(zv->zv_objset);
|
2017-01-20 00:56:36 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* return the proper tag for rollback and recv
|
|
|
|
|
*/
|
|
|
|
|
void *
|
|
|
|
|
zvol_tag(zvol_state_t *zv)
|
|
|
|
|
{
|
|
|
|
|
ASSERT(RW_WRITE_HELD(&zv->zv_suspend_lock));
|
|
|
|
|
return (zv->zv_open_count > 0 ? zv : NULL);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Suspend the zvol for recv and rollback.
|
|
|
|
|
*/
|
|
|
|
|
zvol_state_t *
|
|
|
|
|
zvol_suspend(const char *name)
|
|
|
|
|
{
|
|
|
|
|
zvol_state_t *zv;
|
|
|
|
|
|
2017-06-13 19:03:44 +03:00
|
|
|
zv = zvol_find_by_name(name, RW_WRITER);
|
|
|
|
|
|
|
|
|
|
if (zv == NULL)
|
2017-05-10 20:51:29 +03:00
|
|
|
return (NULL);
|
2012-08-24 18:12:46 +04:00
|
|
|
|
2017-01-20 00:56:36 +03:00
|
|
|
/* block all I/O, release in zvol_resume. */
|
2018-06-16 01:05:21 +03:00
|
|
|
ASSERT(MUTEX_HELD(&zv->zv_state_lock));
|
|
|
|
|
ASSERT(RW_WRITE_HELD(&zv->zv_suspend_lock));
|
2017-01-20 00:56:36 +03:00
|
|
|
|
|
|
|
|
atomic_inc(&zv->zv_suspend_ref);
|
|
|
|
|
|
|
|
|
|
if (zv->zv_open_count > 0)
|
|
|
|
|
zvol_shutdown_zv(zv);
|
2017-05-10 20:51:29 +03:00
|
|
|
|
2017-06-13 19:03:44 +03:00
|
|
|
/*
|
|
|
|
|
* do not hold zv_state_lock across suspend/resume to
|
|
|
|
|
* avoid locking up zvol lookups
|
|
|
|
|
*/
|
2017-05-10 20:51:29 +03:00
|
|
|
mutex_exit(&zv->zv_state_lock);
|
2017-06-13 19:03:44 +03:00
|
|
|
|
|
|
|
|
/* zv_suspend_lock is released in zvol_resume() */
|
2017-01-20 00:56:36 +03:00
|
|
|
return (zv);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
int
|
|
|
|
|
zvol_resume(zvol_state_t *zv)
|
|
|
|
|
{
|
|
|
|
|
int error = 0;
|
|
|
|
|
|
|
|
|
|
ASSERT(RW_WRITE_HELD(&zv->zv_suspend_lock));
|
2017-05-27 03:50:25 +03:00
|
|
|
|
2017-06-13 19:03:44 +03:00
|
|
|
mutex_enter(&zv->zv_state_lock);
|
|
|
|
|
|
2017-01-20 00:56:36 +03:00
|
|
|
if (zv->zv_open_count > 0) {
|
|
|
|
|
VERIFY0(dmu_objset_hold(zv->zv_name, zv, &zv->zv_objset));
|
|
|
|
|
VERIFY3P(zv->zv_objset->os_dsl_dataset->ds_owner, ==, zv);
|
|
|
|
|
VERIFY(dsl_dataset_long_held(zv->zv_objset->os_dsl_dataset));
|
|
|
|
|
dmu_objset_rele(zv->zv_objset, zv);
|
|
|
|
|
|
|
|
|
|
error = zvol_setup_zv(zv);
|
|
|
|
|
}
|
2017-06-13 19:03:44 +03:00
|
|
|
|
|
|
|
|
mutex_exit(&zv->zv_state_lock);
|
|
|
|
|
|
2017-01-20 00:56:36 +03:00
|
|
|
rw_exit(&zv->zv_suspend_lock);
|
|
|
|
|
/*
|
|
|
|
|
* We need this because we don't hold zvol_state_lock while releasing
|
|
|
|
|
* zv_suspend_lock. zvol_remove_minors_impl thus cannot check
|
|
|
|
|
* zv_suspend_lock to determine it is safe to free because rwlock is
|
|
|
|
|
* not inherent atomic.
|
|
|
|
|
*/
|
|
|
|
|
atomic_dec(&zv->zv_suspend_ref);
|
|
|
|
|
|
zvol: ensure device minors are properly cleaned up
Currently, if a minor is in use when we try to remove it, we'll skip it
and never come back to it again. Since the zvol state is hung off the
minor in the kernel, this can get us into weird situations if something
tries to use it after the removal fails. It's even worse at pool export,
as there's now a vestigial zvol state with no pool under it. It's
weirder again if the pool is subsequently reimported, as the zvol code
(reasonably) assumes the zvol state has been properly setup, when it's
actually left over from the previous import of the pool.
This commit attempts to tackle that by setting a flag on the zvol if its
minor can't be removed, and then checking that flag when a request is
made and rejecting it, thus stopping new work coming in.
The flag also causes a condvar to be signaled when the last client
finishes. For the case where a single minor is being removed (eg
changing volmode), it will wait for this signal before proceeding.
Meanwhile, when removing all minors, a background task is created for
each minor that couldn't be removed on the spot, and those tasks then
wake and clean up.
Since any new tasks are queued on to the pool's spa_zvol_taskq,
spa_export_common() will continue to wait at export until all minors are
removed.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #14872
Closes #16364
2024-07-18 06:24:05 +03:00
|
|
|
if (zv->zv_flags & ZVOL_REMOVING)
|
|
|
|
|
cv_broadcast(&zv->zv_removing_cv);
|
|
|
|
|
|
2025-08-09 03:04:01 +03:00
|
|
|
return (error);
|
2017-01-20 00:56:36 +03:00
|
|
|
}
|
|
|
|
|
|
2019-09-25 19:20:30 +03:00
|
|
|
int
|
2018-02-21 03:27:31 +03:00
|
|
|
zvol_first_open(zvol_state_t *zv, boolean_t readonly)
|
2017-01-20 00:56:36 +03:00
|
|
|
{
|
|
|
|
|
objset_t *os;
|
2021-12-02 03:07:12 +03:00
|
|
|
int error;
|
2017-05-11 23:40:33 +03:00
|
|
|
|
2017-06-13 19:03:44 +03:00
|
|
|
ASSERT(RW_READ_HELD(&zv->zv_suspend_lock));
|
|
|
|
|
ASSERT(MUTEX_HELD(&zv->zv_state_lock));
|
2021-12-02 03:07:12 +03:00
|
|
|
ASSERT(mutex_owned(&spa_namespace_lock));
|
2017-06-13 19:03:44 +03:00
|
|
|
|
2021-12-02 03:07:12 +03:00
|
|
|
boolean_t ro = (readonly || (strchr(zv->zv_name, '@') != NULL));
|
2018-02-21 03:27:31 +03:00
|
|
|
error = dmu_objset_own(zv->zv_name, DMU_OST_ZVOL, ro, B_TRUE, zv, &os);
|
2017-01-20 00:56:36 +03:00
|
|
|
if (error)
|
2025-08-09 03:04:01 +03:00
|
|
|
return (error);
|
2017-01-20 00:56:36 +03:00
|
|
|
|
|
|
|
|
zv->zv_objset = os;
|
|
|
|
|
|
|
|
|
|
error = zvol_setup_zv(zv);
|
|
|
|
|
if (error) {
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dmu_objset_disown(os, 1, zv);
|
2017-01-20 00:56:36 +03:00
|
|
|
zv->zv_objset = NULL;
|
|
|
|
|
}
|
|
|
|
|
|
2021-12-02 03:07:12 +03:00
|
|
|
return (error);
|
2017-01-20 00:56:36 +03:00
|
|
|
}
|
|
|
|
|
|
2019-09-25 19:20:30 +03:00
|
|
|
void
|
2017-01-20 00:56:36 +03:00
|
|
|
zvol_last_close(zvol_state_t *zv)
|
|
|
|
|
{
|
2017-06-13 19:03:44 +03:00
|
|
|
ASSERT(RW_READ_HELD(&zv->zv_suspend_lock));
|
|
|
|
|
ASSERT(MUTEX_HELD(&zv->zv_state_lock));
|
|
|
|
|
|
zvol: ensure device minors are properly cleaned up
Currently, if a minor is in use when we try to remove it, we'll skip it
and never come back to it again. Since the zvol state is hung off the
minor in the kernel, this can get us into weird situations if something
tries to use it after the removal fails. It's even worse at pool export,
as there's now a vestigial zvol state with no pool under it. It's
weirder again if the pool is subsequently reimported, as the zvol code
(reasonably) assumes the zvol state has been properly setup, when it's
actually left over from the previous import of the pool.
This commit attempts to tackle that by setting a flag on the zvol if its
minor can't be removed, and then checking that flag when a request is
made and rejecting it, thus stopping new work coming in.
The flag also causes a condvar to be signaled when the last client
finishes. For the case where a single minor is being removed (eg
changing volmode), it will wait for this signal before proceeding.
Meanwhile, when removing all minors, a background task is created for
each minor that couldn't be removed on the spot, and those tasks then
wake and clean up.
Since any new tasks are queued on to the pool's spa_zvol_taskq,
spa_export_common() will continue to wait at export until all minors are
removed.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #14872
Closes #16364
2024-07-18 06:24:05 +03:00
|
|
|
if (zv->zv_flags & ZVOL_REMOVING)
|
|
|
|
|
cv_broadcast(&zv->zv_removing_cv);
|
|
|
|
|
|
2017-01-20 00:56:36 +03:00
|
|
|
zvol_shutdown_zv(zv);
|
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dmu_objset_disown(zv->zv_objset, 1, zv);
|
2010-08-26 22:45:02 +04:00
|
|
|
zv->zv_objset = NULL;
|
|
|
|
|
}
|
|
|
|
|
|
2016-12-01 00:56:50 +03:00
|
|
|
typedef struct minors_job {
|
|
|
|
|
list_t *list;
|
|
|
|
|
list_node_t link;
|
|
|
|
|
/* input */
|
|
|
|
|
char *name;
|
|
|
|
|
/* output */
|
|
|
|
|
int error;
|
|
|
|
|
} minors_job_t;
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Prefetch zvol dnodes for the minors_job
|
|
|
|
|
*/
|
|
|
|
|
static void
|
|
|
|
|
zvol_prefetch_minors_impl(void *arg)
|
|
|
|
|
{
|
|
|
|
|
minors_job_t *job = arg;
|
|
|
|
|
char *dsname = job->name;
|
|
|
|
|
objset_t *os = NULL;
|
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
job->error = dmu_objset_own(dsname, DMU_OST_ZVOL, B_TRUE, B_TRUE,
|
|
|
|
|
FTAG, &os);
|
2016-12-01 00:56:50 +03:00
|
|
|
if (job->error == 0) {
|
2023-08-07 23:54:41 +03:00
|
|
|
dmu_prefetch_dnode(os, ZVOL_OBJ, ZIO_PRIORITY_SYNC_READ);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dmu_objset_disown(os, B_TRUE, FTAG);
|
2016-12-01 00:56:50 +03:00
|
|
|
}
|
|
|
|
|
}
|
2014-03-22 13:07:14 +04:00
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Mask errors to continue dmu_objset_find() traversal
|
|
|
|
|
*/
|
|
|
|
|
static int
|
|
|
|
|
zvol_create_snap_minor_cb(const char *dsname, void *arg)
|
|
|
|
|
{
|
2016-12-01 00:56:50 +03:00
|
|
|
minors_job_t *j = arg;
|
|
|
|
|
list_t *minors_list = j->list;
|
|
|
|
|
const char *name = j->name;
|
2014-03-22 13:07:14 +04:00
|
|
|
|
2015-09-23 19:34:51 +03:00
|
|
|
ASSERT0(MUTEX_HELD(&spa_namespace_lock));
|
|
|
|
|
|
2014-03-22 13:07:14 +04:00
|
|
|
/* skip the designated dataset */
|
|
|
|
|
if (name && strcmp(dsname, name) == 0)
|
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
|
|
/* at this point, the dsname should name a snapshot */
|
|
|
|
|
if (strchr(dsname, '@') == 0) {
|
|
|
|
|
dprintf("zvol_create_snap_minor_cb(): "
|
2019-09-03 03:56:41 +03:00
|
|
|
"%s is not a snapshot name\n", dsname);
|
2014-03-22 13:07:14 +04:00
|
|
|
} else {
|
2016-12-01 00:56:50 +03:00
|
|
|
minors_job_t *job;
|
2019-10-10 19:47:06 +03:00
|
|
|
char *n = kmem_strdup(dsname);
|
2016-12-01 00:56:50 +03:00
|
|
|
if (n == NULL)
|
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
|
|
job = kmem_alloc(sizeof (minors_job_t), KM_SLEEP);
|
|
|
|
|
job->name = n;
|
|
|
|
|
job->list = minors_list;
|
|
|
|
|
job->error = 0;
|
|
|
|
|
list_insert_tail(minors_list, job);
|
|
|
|
|
/* don't care if dispatch fails, because job->error is 0 */
|
|
|
|
|
taskq_dispatch(system_taskq, zvol_prefetch_minors_impl, job,
|
|
|
|
|
TQ_SLEEP);
|
2014-03-22 13:07:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
|
}
|
|
|
|
|
|
2021-09-13 23:27:07 +03:00
|
|
|
/*
|
|
|
|
|
* If spa_keystore_load_wkey() is called for an encrypted zvol,
|
|
|
|
|
* we need to look for any clones also using the key. This function
|
|
|
|
|
* is "best effort" - so we just skip over it if there are failures.
|
|
|
|
|
*/
|
|
|
|
|
static void
|
|
|
|
|
zvol_add_clones(const char *dsname, list_t *minors_list)
|
|
|
|
|
{
|
|
|
|
|
/* Also check if it has clones */
|
|
|
|
|
dsl_dir_t *dd = NULL;
|
|
|
|
|
dsl_pool_t *dp = NULL;
|
|
|
|
|
|
|
|
|
|
if (dsl_pool_hold(dsname, FTAG, &dp) != 0)
|
|
|
|
|
return;
|
|
|
|
|
|
|
|
|
|
if (!spa_feature_is_enabled(dp->dp_spa,
|
|
|
|
|
SPA_FEATURE_ENCRYPTION))
|
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
|
|
if (dsl_dir_hold(dp, dsname, FTAG, &dd, NULL) != 0)
|
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
|
|
if (dsl_dir_phys(dd)->dd_clones == 0)
|
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
|
|
zap_cursor_t *zc = kmem_alloc(sizeof (zap_cursor_t), KM_SLEEP);
|
2021-02-02 16:54:15 +03:00
|
|
|
zap_attribute_t *za = zap_attribute_alloc();
|
2021-09-13 23:27:07 +03:00
|
|
|
objset_t *mos = dd->dd_pool->dp_meta_objset;
|
|
|
|
|
|
|
|
|
|
for (zap_cursor_init(zc, mos, dsl_dir_phys(dd)->dd_clones);
|
|
|
|
|
zap_cursor_retrieve(zc, za) == 0;
|
|
|
|
|
zap_cursor_advance(zc)) {
|
|
|
|
|
dsl_dataset_t *clone;
|
|
|
|
|
minors_job_t *job;
|
|
|
|
|
|
|
|
|
|
if (dsl_dataset_hold_obj(dd->dd_pool,
|
|
|
|
|
za->za_first_integer, FTAG, &clone) == 0) {
|
|
|
|
|
|
|
|
|
|
char name[ZFS_MAX_DATASET_NAME_LEN];
|
|
|
|
|
dsl_dataset_name(clone, name);
|
|
|
|
|
|
|
|
|
|
char *n = kmem_strdup(name);
|
|
|
|
|
job = kmem_alloc(sizeof (minors_job_t), KM_SLEEP);
|
|
|
|
|
job->name = n;
|
|
|
|
|
job->list = minors_list;
|
|
|
|
|
job->error = 0;
|
|
|
|
|
list_insert_tail(minors_list, job);
|
|
|
|
|
|
|
|
|
|
dsl_dataset_rele(clone, FTAG);
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
zap_cursor_fini(zc);
|
2021-02-02 16:54:15 +03:00
|
|
|
zap_attribute_free(za);
|
2021-09-13 23:27:07 +03:00
|
|
|
kmem_free(zc, sizeof (zap_cursor_t));
|
|
|
|
|
|
|
|
|
|
out:
|
|
|
|
|
if (dd != NULL)
|
|
|
|
|
dsl_dir_rele(dd, FTAG);
|
2022-10-17 09:18:09 +03:00
|
|
|
dsl_pool_rele(dp, FTAG);
|
2021-09-13 23:27:07 +03:00
|
|
|
}
|
|
|
|
|
|
2014-03-22 13:07:14 +04:00
|
|
|
/*
|
|
|
|
|
* Mask errors to continue dmu_objset_find() traversal
|
|
|
|
|
*/
|
2010-08-26 22:45:02 +04:00
|
|
|
static int
|
2013-09-04 16:00:57 +04:00
|
|
|
zvol_create_minors_cb(const char *dsname, void *arg)
|
2010-08-26 22:45:02 +04:00
|
|
|
{
|
2014-03-22 13:07:14 +04:00
|
|
|
uint64_t snapdev;
|
|
|
|
|
int error;
|
2016-12-01 00:56:50 +03:00
|
|
|
list_t *minors_list = arg;
|
2014-03-22 13:07:14 +04:00
|
|
|
|
2015-09-23 19:34:51 +03:00
|
|
|
ASSERT0(MUTEX_HELD(&spa_namespace_lock));
|
|
|
|
|
|
2014-03-22 13:07:14 +04:00
|
|
|
error = dsl_prop_get_integer(dsname, "snapdev", &snapdev, NULL);
|
|
|
|
|
if (error)
|
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Given the name and the 'snapdev' property, create device minor nodes
|
|
|
|
|
* with the linkages to zvols/snapshots as needed.
|
|
|
|
|
* If the name represents a zvol, create a minor node for the zvol, then
|
|
|
|
|
* check if its snapshots are 'visible', and if so, iterate over the
|
|
|
|
|
* snapshots and create device minor nodes for those.
|
|
|
|
|
*/
|
|
|
|
|
if (strchr(dsname, '@') == 0) {
|
2016-12-01 00:56:50 +03:00
|
|
|
minors_job_t *job;
|
2019-10-10 19:47:06 +03:00
|
|
|
char *n = kmem_strdup(dsname);
|
2016-12-01 00:56:50 +03:00
|
|
|
if (n == NULL)
|
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
|
|
job = kmem_alloc(sizeof (minors_job_t), KM_SLEEP);
|
|
|
|
|
job->name = n;
|
|
|
|
|
job->list = minors_list;
|
|
|
|
|
job->error = 0;
|
|
|
|
|
list_insert_tail(minors_list, job);
|
|
|
|
|
/* don't care if dispatch fails, because job->error is 0 */
|
|
|
|
|
taskq_dispatch(system_taskq, zvol_prefetch_minors_impl, job,
|
|
|
|
|
TQ_SLEEP);
|
|
|
|
|
|
2021-09-13 23:27:07 +03:00
|
|
|
zvol_add_clones(dsname, minors_list);
|
|
|
|
|
|
2016-12-01 00:56:50 +03:00
|
|
|
if (snapdev == ZFS_SNAPDEV_VISIBLE) {
|
2014-03-22 13:07:14 +04:00
|
|
|
/*
|
|
|
|
|
* traverse snapshots only, do not traverse children,
|
|
|
|
|
* and skip the 'dsname'
|
|
|
|
|
*/
|
Cleanup of dead code suggested by Clang Static Analyzer (#14380)
I recently gained the ability to run Clang's static analyzer on the
linux kernel modules via a few hacks. This extended coverage to code
that was previously missed since Clang's static analyzer only looked at
code that we built in userspace. Running it against the Linux kernel
modules built from my local branch produced a total of 72 reports
against my local branch. Of those, 50 were reports of logic errors and
22 were reports of dead code. Since we already had cleaned up all of
the previous dead code reports, I felt it would be a good next step to
clean up these dead code reports. Clang did a further breakdown of the
dead code reports into:
Dead assignment 15
Dead increment 2
Dead nested assignment 5
The benefit of cleaning these up, especially in the case of dead nested
assignment, is that they can expose places where our error handling is
incorrect. A number of them were fairly straight forward. However
several were not:
In vdev_disk_physio_completion(), not only were we not using the return
value from the static function vdev_disk_dio_put(), but nothing used it,
so I changed it to return void and removed the existing (void) cast in
the other area where we call it in addition to no longer storing it to a
stack value.
In FSE_createDTable(), the function is dead code. Its helper function
FSE_freeDTable() is also dead code, as are the CPP definitions in
`module/zstd/include/zstd_compat_wrapper.h`. We just delete it all.
In zfs_zevent_wait(), we have an optimization opportunity. cv_wait_sig()
returns 0 if there are waiting signals and 1 if there are none. The
Linux SPL version literally returns `signal_pending(current) ? 0 : 1)`
and FreeBSD implements the same semantics, we can just do
`!cv_wait_sig()` in place of `signal_pending(current)` to avoid
unnecessarily calling it again.
zfs_setattr() on FreeBSD version did not have error handling issue
because the code was removed entirely from FreeBSD version. The error is
from updating the attribute directory's files. After some thought, I
decided to propapage errors on it to userspace.
In zfs_secpolicy_tmp_snapshot(), we ignore a lack of permission from the
first check in favor of checking three other permissions. I assume this
is intentional.
In zfs_create_fs(), the return value of zap_update() was not checked
despite setting an important version number. I see no backward
compatibility reason to permit failures, so we add an assertion to catch
failures. Interestingly, Linux is still using ASSERT(error == 0) from
OpenSolaris while FreeBSD has switched to the improved ASSERT0(error)
from illumos, although illumos has yet to adopt it here. ASSERT(error ==
0) was used on Linux while ASSERT0(error) was used on FreeBSD since the
entire file needs conversion and that should be the subject of
another patch.
dnode_move()'s issue was caused by us not having implemented
POINTER_IS_VALID() on Linux. We have a stub in
`include/os/linux/spl/sys/kmem_cache.h` for it, when it really should be
in `include/os/linux/spl/sys/kmem.h` to be consistent with
Illumos/OpenSolaris. FreeBSD put both `POINTER_IS_VALID()` and
`POINTER_INVALIDATE()` in `include/os/freebsd/spl/sys/kmem.h`, so we
copy what it did.
Whenever a report was in platform-specific code, I checked the FreeBSD
version to see if it also applied to FreeBSD, but it was only relevant a
few times.
Lastly, the patch that enabled Clang's static analyzer to be run on the
Linux kernel modules needs more work before it can be put into a PR. I
plan to do that in the future as part of the on-going static analysis
work that I am doing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14380
2023-01-17 20:57:12 +03:00
|
|
|
(void) dmu_objset_find(dsname,
|
2016-12-01 00:56:50 +03:00
|
|
|
zvol_create_snap_minor_cb, (void *)job,
|
2014-03-22 13:07:14 +04:00
|
|
|
DS_FIND_SNAPSHOTS);
|
|
|
|
|
}
|
|
|
|
|
} else {
|
|
|
|
|
dprintf("zvol_create_minors_cb(): %s is not a zvol name\n",
|
2016-12-12 21:46:26 +03:00
|
|
|
dsname);
|
2014-03-22 13:07:14 +04:00
|
|
|
}
|
2010-08-26 22:45:02 +04:00
|
|
|
|
2011-02-16 20:40:29 +03:00
|
|
|
return (0);
|
2010-08-26 22:45:02 +04:00
|
|
|
}
|
|
|
|
|
|
2025-08-06 17:10:52 +03:00
|
|
|
static void
|
|
|
|
|
zvol_task_update_status(zvol_task_t *task, uint64_t total, uint64_t done,
|
|
|
|
|
int error)
|
|
|
|
|
{
|
|
|
|
|
|
|
|
|
|
task->zt_total += total;
|
|
|
|
|
task->zt_done += done;
|
|
|
|
|
if (task->zt_total != task->zt_done) {
|
|
|
|
|
task->zt_status = -1;
|
|
|
|
|
if (error)
|
|
|
|
|
task->zt_error = error;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
static void
|
|
|
|
|
zvol_task_report_status(zvol_task_t *task)
|
|
|
|
|
{
|
2025-07-25 06:17:18 +03:00
|
|
|
#ifdef ZFS_DEBUG
|
|
|
|
|
static const char *const msg[] = {
|
|
|
|
|
"create",
|
|
|
|
|
"remove",
|
|
|
|
|
"rename",
|
|
|
|
|
"set snapdev",
|
|
|
|
|
"set volmode",
|
|
|
|
|
"unknown",
|
|
|
|
|
};
|
2025-08-06 17:10:52 +03:00
|
|
|
|
|
|
|
|
if (task->zt_status == 0)
|
|
|
|
|
return;
|
|
|
|
|
|
2025-07-25 06:17:18 +03:00
|
|
|
zvol_async_op_t op = MIN(task->zt_op, ZVOL_ASYNC_MAX);
|
2025-08-06 17:10:52 +03:00
|
|
|
if (task->zt_error) {
|
|
|
|
|
dprintf("The %s minors zvol task was not ok, last error %d\n",
|
2025-07-25 06:17:18 +03:00
|
|
|
msg[op], task->zt_error);
|
2025-08-06 17:10:52 +03:00
|
|
|
} else {
|
2025-07-25 06:17:18 +03:00
|
|
|
dprintf("The %s minors zvol task was not ok\n", msg[op]);
|
2025-08-06 17:10:52 +03:00
|
|
|
}
|
2025-07-25 06:17:18 +03:00
|
|
|
#else
|
|
|
|
|
(void) task;
|
|
|
|
|
#endif
|
2025-08-06 17:10:52 +03:00
|
|
|
}
|
|
|
|
|
|
2010-08-26 22:45:02 +04:00
|
|
|
/*
|
2014-03-22 13:07:14 +04:00
|
|
|
* Create minors for the specified dataset, including children and snapshots.
|
|
|
|
|
* Pay attention to the 'snapdev' property and iterate over the snapshots
|
|
|
|
|
* only if they are 'visible'. This approach allows one to assure that the
|
|
|
|
|
* snapshot metadata is read from disk only if it is needed.
|
|
|
|
|
*
|
|
|
|
|
* The name can represent a dataset to be recursively scanned for zvols and
|
|
|
|
|
* their snapshots, or a single zvol snapshot. If the name represents a
|
|
|
|
|
* dataset, the scan is performed in two nested stages:
|
|
|
|
|
* - scan the dataset for zvols, and
|
|
|
|
|
* - for each zvol, create a minor node, then check if the zvol's snapshots
|
|
|
|
|
* are 'visible', and only then iterate over the snapshots if needed
|
|
|
|
|
*
|
2017-01-03 20:31:18 +03:00
|
|
|
* If the name represents a snapshot, a check is performed if the snapshot is
|
2014-03-22 13:07:14 +04:00
|
|
|
* 'visible' (which also verifies that the parent is a zvol), and if so,
|
|
|
|
|
* a minor node for that snapshot is created.
|
2010-08-26 22:45:02 +04:00
|
|
|
*/
|
2025-08-06 17:10:52 +03:00
|
|
|
static void
|
|
|
|
|
zvol_create_minors_impl(zvol_task_t *task)
|
2010-08-26 22:45:02 +04:00
|
|
|
{
|
2025-08-06 17:10:52 +03:00
|
|
|
const char *name = task->zt_name1;
|
2016-12-01 00:56:50 +03:00
|
|
|
list_t minors_list;
|
|
|
|
|
minors_job_t *job;
|
2025-08-06 17:10:52 +03:00
|
|
|
uint64_t snapdev;
|
|
|
|
|
int total = 0, done = 0, last_error, error;
|
2010-08-26 22:45:02 +04:00
|
|
|
|
2025-08-06 17:10:52 +03:00
|
|
|
/*
|
|
|
|
|
* Note: the dsl_pool_config_lock must not be held.
|
|
|
|
|
* Minor node creation needs to obtain the zvol_state_lock.
|
|
|
|
|
* zvol_open() obtains the zvol_state_lock and then the dsl pool
|
|
|
|
|
* config lock. Therefore, we can't have the config lock now if
|
|
|
|
|
* we are going to wait for the zvol_state_lock, because it
|
|
|
|
|
* would be a lock order inversion which could lead to deadlock.
|
|
|
|
|
*/
|
|
|
|
|
|
|
|
|
|
if (zvol_inhibit_dev) {
|
async zvol minor node creation interferes with receive
When we finish a zfs receive, dmu_recv_end_sync() calls
zvol_create_minors(async=TRUE). This kicks off some other threads that
create the minor device nodes (in /dev/zvol/poolname/...). These async
threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which
both call dmu_objset_own(), which puts a "long hold" on the dataset.
Since the zvol minor node creation is asynchronous, this can happen
after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have
completed.
After the first receive ioctl has completed, userland may attempt to do
another receive into the same dataset (e.g. the next incremental
stream). This second receive and the asynchronous minor node creation
can interfere with one another in several different ways, because they
both require exclusive access to the dataset:
1. When the second receive is finishing up, dmu_recv_end_check() does
dsl_dataset_handoff_check(), which can fail with EBUSY if the async
minor node creation already has a "long hold" on this dataset. This
causes the 2nd receive to fail.
2. The async udev rule can fail if zvol_id and/or systemd-udevd try to
open the device while the the second receive's async attempt at minor
node creation owns the dataset (via zvol_prefetch_minors_impl). This
causes the minor node (/dev/zd*) to exist, but the udev-generated
/dev/zvol/... to not exist.
3. The async minor node creation can silently fail with EBUSY if the
first receive's zvol_create_minor() trys to own the dataset while the
second receive's zvol_prefetch_minors_impl already owns the dataset.
To address these problems, this change synchronously creates the minor
node. To avoid the lock ordering problems that the asynchrony was
introduced to fix (see #3681), we create the minor nodes from open
context, with no locks held, rather than from syncing contex as was
originally done.
Implementation notes:
We generally do not need to traverse children or prefetch anything (e.g.
when running the recv, snapshot, create, or clone subcommands of zfs).
We only need recursion when importing/opening a pool and when loading
encryption keys. The existing recursive, asynchronous, prefetching code
is preserved for use in these cases.
Channel programs may need to create zvol minor nodes, when creating a
snapshot of a zvol with the snapdev property set. We figure out what
snapshots are created when running the LUA program in syncing context.
In this case we need to remember what snapshots were created, and then
try to create their minor nodes from open context, after the LUA code
has completed.
There are additional zvol use cases that asynchronously own the dataset,
which can cause similar problems. E.g. changing the volmode or snapdev
properties. These are less problematic because they are not recursive
and don't touch datasets that are not involved in the operation, there
is still potential for interference with subsequent operations. In the
future, these cases should be similarly converted to create the zvol
minor node synchronously from open context.
The async tasks of removing and renaming minors do not own the objset,
so they do not have this problem. However, it may make sense to also
convert these operations to happen synchronously from open context, in
the future.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-65948
Closes #7863
Closes #9885
2020-02-03 20:33:14 +03:00
|
|
|
return;
|
2025-08-06 17:10:52 +03:00
|
|
|
}
|
2016-02-16 22:52:55 +03:00
|
|
|
|
2016-12-01 00:56:50 +03:00
|
|
|
/*
|
|
|
|
|
* This is the list for prefetch jobs. Whenever we found a match
|
|
|
|
|
* during dmu_objset_find, we insert a minors_job to the list and do
|
|
|
|
|
* taskq_dispatch to parallel prefetch zvol dnodes. Note we don't need
|
|
|
|
|
* any lock because all list operation is done on the current thread.
|
|
|
|
|
*
|
2022-02-07 21:24:38 +03:00
|
|
|
* We will use this list to do zvol_os_create_minor after prefetch
|
2016-12-01 00:56:50 +03:00
|
|
|
* so we don't have to traverse using dmu_objset_find again.
|
|
|
|
|
*/
|
|
|
|
|
list_create(&minors_list, sizeof (minors_job_t),
|
|
|
|
|
offsetof(minors_job_t, link));
|
|
|
|
|
|
2014-03-22 13:07:14 +04:00
|
|
|
|
async zvol minor node creation interferes with receive
When we finish a zfs receive, dmu_recv_end_sync() calls
zvol_create_minors(async=TRUE). This kicks off some other threads that
create the minor device nodes (in /dev/zvol/poolname/...). These async
threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which
both call dmu_objset_own(), which puts a "long hold" on the dataset.
Since the zvol minor node creation is asynchronous, this can happen
after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have
completed.
After the first receive ioctl has completed, userland may attempt to do
another receive into the same dataset (e.g. the next incremental
stream). This second receive and the asynchronous minor node creation
can interfere with one another in several different ways, because they
both require exclusive access to the dataset:
1. When the second receive is finishing up, dmu_recv_end_check() does
dsl_dataset_handoff_check(), which can fail with EBUSY if the async
minor node creation already has a "long hold" on this dataset. This
causes the 2nd receive to fail.
2. The async udev rule can fail if zvol_id and/or systemd-udevd try to
open the device while the the second receive's async attempt at minor
node creation owns the dataset (via zvol_prefetch_minors_impl). This
causes the minor node (/dev/zd*) to exist, but the udev-generated
/dev/zvol/... to not exist.
3. The async minor node creation can silently fail with EBUSY if the
first receive's zvol_create_minor() trys to own the dataset while the
second receive's zvol_prefetch_minors_impl already owns the dataset.
To address these problems, this change synchronously creates the minor
node. To avoid the lock ordering problems that the asynchrony was
introduced to fix (see #3681), we create the minor nodes from open
context, with no locks held, rather than from syncing contex as was
originally done.
Implementation notes:
We generally do not need to traverse children or prefetch anything (e.g.
when running the recv, snapshot, create, or clone subcommands of zfs).
We only need recursion when importing/opening a pool and when loading
encryption keys. The existing recursive, asynchronous, prefetching code
is preserved for use in these cases.
Channel programs may need to create zvol minor nodes, when creating a
snapshot of a zvol with the snapdev property set. We figure out what
snapshots are created when running the LUA program in syncing context.
In this case we need to remember what snapshots were created, and then
try to create their minor nodes from open context, after the LUA code
has completed.
There are additional zvol use cases that asynchronously own the dataset,
which can cause similar problems. E.g. changing the volmode or snapdev
properties. These are less problematic because they are not recursive
and don't touch datasets that are not involved in the operation, there
is still potential for interference with subsequent operations. In the
future, these cases should be similarly converted to create the zvol
minor node synchronously from open context.
The async tasks of removing and renaming minors do not own the objset,
so they do not have this problem. However, it may make sense to also
convert these operations to happen synchronously from open context, in
the future.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-65948
Closes #7863
Closes #9885
2020-02-03 20:33:14 +03:00
|
|
|
if (strchr(name, '@') != NULL) {
|
2025-08-06 17:10:52 +03:00
|
|
|
error = dsl_prop_get_integer(name, "snapdev", &snapdev, NULL);
|
|
|
|
|
if (error == 0 && snapdev == ZFS_SNAPDEV_VISIBLE) {
|
|
|
|
|
error = zvol_os_create_minor(name);
|
|
|
|
|
if (error == 0) {
|
|
|
|
|
done++;
|
|
|
|
|
} else {
|
|
|
|
|
last_error = error;
|
|
|
|
|
}
|
|
|
|
|
total++;
|
|
|
|
|
}
|
2014-03-22 13:07:14 +04:00
|
|
|
} else {
|
async zvol minor node creation interferes with receive
When we finish a zfs receive, dmu_recv_end_sync() calls
zvol_create_minors(async=TRUE). This kicks off some other threads that
create the minor device nodes (in /dev/zvol/poolname/...). These async
threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which
both call dmu_objset_own(), which puts a "long hold" on the dataset.
Since the zvol minor node creation is asynchronous, this can happen
after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have
completed.
After the first receive ioctl has completed, userland may attempt to do
another receive into the same dataset (e.g. the next incremental
stream). This second receive and the asynchronous minor node creation
can interfere with one another in several different ways, because they
both require exclusive access to the dataset:
1. When the second receive is finishing up, dmu_recv_end_check() does
dsl_dataset_handoff_check(), which can fail with EBUSY if the async
minor node creation already has a "long hold" on this dataset. This
causes the 2nd receive to fail.
2. The async udev rule can fail if zvol_id and/or systemd-udevd try to
open the device while the the second receive's async attempt at minor
node creation owns the dataset (via zvol_prefetch_minors_impl). This
causes the minor node (/dev/zd*) to exist, but the udev-generated
/dev/zvol/... to not exist.
3. The async minor node creation can silently fail with EBUSY if the
first receive's zvol_create_minor() trys to own the dataset while the
second receive's zvol_prefetch_minors_impl already owns the dataset.
To address these problems, this change synchronously creates the minor
node. To avoid the lock ordering problems that the asynchrony was
introduced to fix (see #3681), we create the minor nodes from open
context, with no locks held, rather than from syncing contex as was
originally done.
Implementation notes:
We generally do not need to traverse children or prefetch anything (e.g.
when running the recv, snapshot, create, or clone subcommands of zfs).
We only need recursion when importing/opening a pool and when loading
encryption keys. The existing recursive, asynchronous, prefetching code
is preserved for use in these cases.
Channel programs may need to create zvol minor nodes, when creating a
snapshot of a zvol with the snapdev property set. We figure out what
snapshots are created when running the LUA program in syncing context.
In this case we need to remember what snapshots were created, and then
try to create their minor nodes from open context, after the LUA code
has completed.
There are additional zvol use cases that asynchronously own the dataset,
which can cause similar problems. E.g. changing the volmode or snapdev
properties. These are less problematic because they are not recursive
and don't touch datasets that are not involved in the operation, there
is still potential for interference with subsequent operations. In the
future, these cases should be similarly converted to create the zvol
minor node synchronously from open context.
The async tasks of removing and renaming minors do not own the objset,
so they do not have this problem. However, it may make sense to also
convert these operations to happen synchronously from open context, in
the future.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-65948
Closes #7863
Closes #9885
2020-02-03 20:33:14 +03:00
|
|
|
fstrans_cookie_t cookie = spl_fstrans_mark();
|
|
|
|
|
(void) dmu_objset_find(name, zvol_create_minors_cb,
|
2016-12-01 00:56:50 +03:00
|
|
|
&minors_list, DS_FIND_CHILDREN);
|
2014-03-22 13:07:14 +04:00
|
|
|
spl_fstrans_unmark(cookie);
|
|
|
|
|
}
|
|
|
|
|
|
2016-12-01 00:56:50 +03:00
|
|
|
taskq_wait_outstanding(system_taskq, 0);
|
|
|
|
|
|
|
|
|
|
/*
|
2022-02-07 21:24:38 +03:00
|
|
|
* Prefetch is completed, we can do zvol_os_create_minor
|
2016-12-01 00:56:50 +03:00
|
|
|
* sequentially.
|
|
|
|
|
*/
|
2023-06-09 20:12:52 +03:00
|
|
|
while ((job = list_remove_head(&minors_list)) != NULL) {
|
2025-08-06 17:10:52 +03:00
|
|
|
if (!job->error) {
|
|
|
|
|
error = zvol_os_create_minor(job->name);
|
|
|
|
|
if (error == 0) {
|
|
|
|
|
done++;
|
|
|
|
|
} else {
|
|
|
|
|
last_error = error;
|
|
|
|
|
}
|
|
|
|
|
} else if (job->error == EINVAL) {
|
|
|
|
|
/*
|
|
|
|
|
* The objset, with the name requested by current job
|
|
|
|
|
* exist, but have the type different from zvol.
|
|
|
|
|
* Just ignore this sort of errors.
|
|
|
|
|
*/
|
|
|
|
|
done++;
|
|
|
|
|
} else {
|
|
|
|
|
last_error = job->error;
|
|
|
|
|
}
|
|
|
|
|
total++;
|
2019-10-10 19:47:06 +03:00
|
|
|
kmem_strfree(job->name);
|
2016-12-01 00:56:50 +03:00
|
|
|
kmem_free(job, sizeof (minors_job_t));
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
list_destroy(&minors_list);
|
2025-08-06 17:10:52 +03:00
|
|
|
zvol_task_update_status(task, total, done, last_error);
|
2013-12-07 02:20:22 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Remove minors for specified dataset including children and snapshots.
|
|
|
|
|
*/
|
2019-09-25 19:20:30 +03:00
|
|
|
|
zvol: ensure device minors are properly cleaned up
Currently, if a minor is in use when we try to remove it, we'll skip it
and never come back to it again. Since the zvol state is hung off the
minor in the kernel, this can get us into weird situations if something
tries to use it after the removal fails. It's even worse at pool export,
as there's now a vestigial zvol state with no pool under it. It's
weirder again if the pool is subsequently reimported, as the zvol code
(reasonably) assumes the zvol state has been properly setup, when it's
actually left over from the previous import of the pool.
This commit attempts to tackle that by setting a flag on the zvol if its
minor can't be removed, and then checking that flag when a request is
made and rejecting it, thus stopping new work coming in.
The flag also causes a condvar to be signaled when the last client
finishes. For the case where a single minor is being removed (eg
changing volmode), it will wait for this signal before proceeding.
Meanwhile, when removing all minors, a background task is created for
each minor that couldn't be removed on the spot, and those tasks then
wake and clean up.
Since any new tasks are queued on to the pool's spa_zvol_taskq,
spa_export_common() will continue to wait at export until all minors are
removed.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #14872
Closes #16364
2024-07-18 06:24:05 +03:00
|
|
|
/*
|
|
|
|
|
* Remove the minor for a given zvol. This will do it all:
|
|
|
|
|
* - flag the zvol for removal, so new requests are rejected
|
|
|
|
|
* - wait until outstanding requests are completed
|
|
|
|
|
* - remove it from lists
|
|
|
|
|
* - free it
|
|
|
|
|
* It's also usable as a taskq task, and smells nice too.
|
|
|
|
|
*/
|
|
|
|
|
static void
|
|
|
|
|
zvol_remove_minor_task(void *arg)
|
|
|
|
|
{
|
|
|
|
|
zvol_state_t *zv = (zvol_state_t *)arg;
|
|
|
|
|
|
|
|
|
|
ASSERT(!RW_LOCK_HELD(&zvol_state_lock));
|
|
|
|
|
ASSERT(!MUTEX_HELD(&zv->zv_state_lock));
|
|
|
|
|
|
|
|
|
|
mutex_enter(&zv->zv_state_lock);
|
|
|
|
|
while (zv->zv_open_count > 0 || atomic_read(&zv->zv_suspend_ref)) {
|
|
|
|
|
zv->zv_flags |= ZVOL_REMOVING;
|
|
|
|
|
cv_wait(&zv->zv_removing_cv, &zv->zv_state_lock);
|
|
|
|
|
}
|
|
|
|
|
mutex_exit(&zv->zv_state_lock);
|
|
|
|
|
|
|
|
|
|
rw_enter(&zvol_state_lock, RW_WRITER);
|
|
|
|
|
mutex_enter(&zv->zv_state_lock);
|
|
|
|
|
|
zvol: remove the OS-side minor before freeing the zvol
When destroying a zvol, it is not "unpublished" from the system (that
is, /dev/zd* node removed) until zvol_os_free(). Under Linux, at the
time del_gendisk() and put_disk() are called, the device node may still
be have an active hold, from a userspace program or something inside the
kernel (a partition probe). As it is currently, this can lead to calls
to zvol_open() or zvol_release() while the zvol_state_t is partially or
fully freed. zvol_open() has some protection against this by checking
that private_data is NULL, but zvol_release does not.
This implements a better ordering for all of this by adding a new
OS-side method, zvol_os_remove_minor(), which is responsible for fully
decoupling the "private" (OS-side) objects from the zvol_state_t. For
Linux, that means calling put_disk(), nulling private_data, and freeing
zv_zso.
This takes the place of zvol_os_clear_private(), which was a nod in that
direction but did not do enough, and did not do it early enough.
Equivalent changes are made on the FreeBSD side to follow the API
change.
Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17625
2025-08-05 06:43:17 +03:00
|
|
|
zvol_os_remove_minor(zv);
|
zvol: ensure device minors are properly cleaned up
Currently, if a minor is in use when we try to remove it, we'll skip it
and never come back to it again. Since the zvol state is hung off the
minor in the kernel, this can get us into weird situations if something
tries to use it after the removal fails. It's even worse at pool export,
as there's now a vestigial zvol state with no pool under it. It's
weirder again if the pool is subsequently reimported, as the zvol code
(reasonably) assumes the zvol state has been properly setup, when it's
actually left over from the previous import of the pool.
This commit attempts to tackle that by setting a flag on the zvol if its
minor can't be removed, and then checking that flag when a request is
made and rejecting it, thus stopping new work coming in.
The flag also causes a condvar to be signaled when the last client
finishes. For the case where a single minor is being removed (eg
changing volmode), it will wait for this signal before proceeding.
Meanwhile, when removing all minors, a background task is created for
each minor that couldn't be removed on the spot, and those tasks then
wake and clean up.
Since any new tasks are queued on to the pool's spa_zvol_taskq,
spa_export_common() will continue to wait at export until all minors are
removed.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #14872
Closes #16364
2024-07-18 06:24:05 +03:00
|
|
|
zvol_remove(zv);
|
|
|
|
|
|
|
|
|
|
mutex_exit(&zv->zv_state_lock);
|
|
|
|
|
rw_exit(&zvol_state_lock);
|
|
|
|
|
|
|
|
|
|
zvol_os_free(zv);
|
|
|
|
|
}
|
|
|
|
|
|
2021-07-20 17:03:33 +03:00
|
|
|
static void
|
|
|
|
|
zvol_free_task(void *arg)
|
|
|
|
|
{
|
2022-02-07 21:24:38 +03:00
|
|
|
zvol_os_free(arg);
|
2021-07-20 17:03:33 +03:00
|
|
|
}
|
|
|
|
|
|
2025-08-05 05:50:31 +03:00
|
|
|
/*
|
|
|
|
|
* Remove minors for specified dataset and, optionally, its children and
|
|
|
|
|
* snapshots.
|
|
|
|
|
*/
|
2025-08-06 17:10:52 +03:00
|
|
|
static void
|
|
|
|
|
zvol_remove_minors_impl(zvol_task_t *task)
|
2013-12-07 02:20:22 +04:00
|
|
|
{
|
|
|
|
|
zvol_state_t *zv, *zv_next;
|
2025-08-06 17:10:52 +03:00
|
|
|
const char *name = task ? task->zt_name1 : NULL;
|
2013-12-07 02:20:22 +04:00
|
|
|
int namelen = ((name) ? strlen(name) : 0);
|
2025-08-05 05:50:31 +03:00
|
|
|
boolean_t children = task ? !!task->zt_value : B_TRUE;
|
|
|
|
|
|
Timeout waiting for ZVOL device to be created
We've seen cases where after creating a ZVOL, the ZVOL device node in
"/dev" isn't generated after 20 seconds of waiting, which is the point
at which our applications gives up on waiting and reports an error.
The workload when this occurs is to "refresh" 400+ ZVOLs roughly at the
same time, based on a policy set by the user. This refresh operation
will destroy the ZVOL, and re-create it based on a snapshot.
When this occurs, we see many hundreds of entries on the "z_zvol" taskq
(based on inspection of the /proc/spl/taskq-all file). Many of the
entries on the taskq end up in the "zvol_remove_minors_impl" function,
and I've measured the latency of that function:
Function = zvol_remove_minors_impl
msecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 1 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 1 | |
128 -> 255 : 45 |****************************************|
256 -> 511 : 5 |**** |
That data is from a 10 second sample, using the BCC "funclatency" tool.
As we can see, in this 10 second sample, most calls took 128ms at a
minimum. Thus, some basic math tells us that in any 20 second interval,
we could only process at most about 150 removals, which is much less
than the 400+ that'll occur based on the workload.
As a result of this, and since all ZVOL minor operations will go through
the single threaded "z_zvol" taskq, the latency for creating a single
ZVOL device can be unreasonably large due to other ZVOL activity on the
system. In our case, it's large enough to cause the application to
generate an error and fail the operation.
When profiling the "zvol_remove_minors_impl" function, I saw that most
of the time in the function was spent off-cpu, blocked in the function
"taskq_wait_outstanding". How this works, is "zvol_remove_minors_impl"
will dispatch calls to "zvol_free" using the "system_taskq", and then
the "taskq_wait_outstanding" function is used to wait for all of those
dispatched calls to occur before "zvol_remove_minors_impl" will return.
As far as I can tell, "zvol_remove_minors_impl" doesn't necessarily have
to wait for all calls to "zvol_free" to occur before it returns. Thus,
this change removes the call to "taskq_wait_oustanding", so that calls
to "zvol_free" don't affect the latency of "zvol_remove_minors_impl".
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes #9380
2019-10-01 22:33:12 +03:00
|
|
|
taskqid_t t;
|
zvol: ensure device minors are properly cleaned up
Currently, if a minor is in use when we try to remove it, we'll skip it
and never come back to it again. Since the zvol state is hung off the
minor in the kernel, this can get us into weird situations if something
tries to use it after the removal fails. It's even worse at pool export,
as there's now a vestigial zvol state with no pool under it. It's
weirder again if the pool is subsequently reimported, as the zvol code
(reasonably) assumes the zvol state has been properly setup, when it's
actually left over from the previous import of the pool.
This commit attempts to tackle that by setting a flag on the zvol if its
minor can't be removed, and then checking that flag when a request is
made and rejecting it, thus stopping new work coming in.
The flag also causes a condvar to be signaled when the last client
finishes. For the case where a single minor is being removed (eg
changing volmode), it will wait for this signal before proceeding.
Meanwhile, when removing all minors, a background task is created for
each minor that couldn't be removed on the spot, and those tasks then
wake and clean up.
Since any new tasks are queued on to the pool's spa_zvol_taskq,
spa_export_common() will continue to wait at export until all minors are
removed.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #14872
Closes #16364
2024-07-18 06:24:05 +03:00
|
|
|
list_t delay_list, free_list;
|
2013-12-07 02:20:22 +04:00
|
|
|
|
2012-06-02 05:49:10 +04:00
|
|
|
if (zvol_inhibit_dev)
|
2013-12-07 02:20:22 +04:00
|
|
|
return;
|
2012-06-02 05:49:10 +04:00
|
|
|
|
zvol: ensure device minors are properly cleaned up
Currently, if a minor is in use when we try to remove it, we'll skip it
and never come back to it again. Since the zvol state is hung off the
minor in the kernel, this can get us into weird situations if something
tries to use it after the removal fails. It's even worse at pool export,
as there's now a vestigial zvol state with no pool under it. It's
weirder again if the pool is subsequently reimported, as the zvol code
(reasonably) assumes the zvol state has been properly setup, when it's
actually left over from the previous import of the pool.
This commit attempts to tackle that by setting a flag on the zvol if its
minor can't be removed, and then checking that flag when a request is
made and rejecting it, thus stopping new work coming in.
The flag also causes a condvar to be signaled when the last client
finishes. For the case where a single minor is being removed (eg
changing volmode), it will wait for this signal before proceeding.
Meanwhile, when removing all minors, a background task is created for
each minor that couldn't be removed on the spot, and those tasks then
wake and clean up.
Since any new tasks are queued on to the pool's spa_zvol_taskq,
spa_export_common() will continue to wait at export until all minors are
removed.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #14872
Closes #16364
2024-07-18 06:24:05 +03:00
|
|
|
list_create(&delay_list, sizeof (zvol_state_t),
|
|
|
|
|
offsetof(zvol_state_t, zv_next));
|
2017-05-10 20:51:29 +03:00
|
|
|
list_create(&free_list, sizeof (zvol_state_t),
|
|
|
|
|
offsetof(zvol_state_t, zv_next));
|
|
|
|
|
|
2025-08-05 05:50:31 +03:00
|
|
|
int error = ENOENT;
|
|
|
|
|
|
2018-06-16 01:05:21 +03:00
|
|
|
rw_enter(&zvol_state_lock, RW_WRITER);
|
2013-12-07 02:20:22 +04:00
|
|
|
|
|
|
|
|
for (zv = list_head(&zvol_state_list); zv != NULL; zv = zv_next) {
|
|
|
|
|
zv_next = list_next(&zvol_state_list, zv);
|
|
|
|
|
|
2017-06-13 19:03:44 +03:00
|
|
|
mutex_enter(&zv->zv_state_lock);
|
2025-08-05 05:50:31 +03:00
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* This zvol should be removed if:
|
|
|
|
|
* - no name was offered (ie removing all at shutdown); or
|
|
|
|
|
* - name matches exactly; or
|
|
|
|
|
* - we were asked to remove children, and
|
|
|
|
|
* - the start of the name matches, and
|
|
|
|
|
* - there is a '/' immediately after the matched name; or
|
|
|
|
|
* - there is a '@' immediately after the matched name
|
|
|
|
|
*/
|
2013-12-07 02:20:22 +04:00
|
|
|
if (name == NULL || strcmp(zv->zv_name, name) == 0 ||
|
2025-08-05 05:50:31 +03:00
|
|
|
(children && strncmp(zv->zv_name, name, namelen) == 0 &&
|
2016-02-16 22:52:55 +03:00
|
|
|
(zv->zv_name[namelen] == '/' ||
|
|
|
|
|
zv->zv_name[namelen] == '@'))) {
|
2017-05-10 20:51:29 +03:00
|
|
|
/*
|
2017-06-13 19:03:44 +03:00
|
|
|
* By holding zv_state_lock here, we guarantee that no
|
2017-05-10 20:51:29 +03:00
|
|
|
* one is currently using this zv
|
|
|
|
|
*/
|
2025-08-05 05:50:31 +03:00
|
|
|
error = 0;
|
2017-06-15 21:08:45 +03:00
|
|
|
|
zvol: ensure device minors are properly cleaned up
Currently, if a minor is in use when we try to remove it, we'll skip it
and never come back to it again. Since the zvol state is hung off the
minor in the kernel, this can get us into weird situations if something
tries to use it after the removal fails. It's even worse at pool export,
as there's now a vestigial zvol state with no pool under it. It's
weirder again if the pool is subsequently reimported, as the zvol code
(reasonably) assumes the zvol state has been properly setup, when it's
actually left over from the previous import of the pool.
This commit attempts to tackle that by setting a flag on the zvol if its
minor can't be removed, and then checking that flag when a request is
made and rejecting it, thus stopping new work coming in.
The flag also causes a condvar to be signaled when the last client
finishes. For the case where a single minor is being removed (eg
changing volmode), it will wait for this signal before proceeding.
Meanwhile, when removing all minors, a background task is created for
each minor that couldn't be removed on the spot, and those tasks then
wake and clean up.
Since any new tasks are queued on to the pool's spa_zvol_taskq,
spa_export_common() will continue to wait at export until all minors are
removed.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #14872
Closes #16364
2024-07-18 06:24:05 +03:00
|
|
|
/*
|
|
|
|
|
* If in use, try to throw everyone off and try again
|
|
|
|
|
* later.
|
|
|
|
|
*/
|
zvol: remove the OS-side minor before freeing the zvol
When destroying a zvol, it is not "unpublished" from the system (that
is, /dev/zd* node removed) until zvol_os_free(). Under Linux, at the
time del_gendisk() and put_disk() are called, the device node may still
be have an active hold, from a userspace program or something inside the
kernel (a partition probe). As it is currently, this can lead to calls
to zvol_open() or zvol_release() while the zvol_state_t is partially or
fully freed. zvol_open() has some protection against this by checking
that private_data is NULL, but zvol_release does not.
This implements a better ordering for all of this by adding a new
OS-side method, zvol_os_remove_minor(), which is responsible for fully
decoupling the "private" (OS-side) objects from the zvol_state_t. For
Linux, that means calling put_disk(), nulling private_data, and freeing
zv_zso.
This takes the place of zvol_os_clear_private(), which was a nod in that
direction but did not do enough, and did not do it early enough.
Equivalent changes are made on the FreeBSD side to follow the API
change.
Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17625
2025-08-05 06:43:17 +03:00
|
|
|
zv->zv_flags |= ZVOL_REMOVING;
|
2017-06-15 21:08:45 +03:00
|
|
|
if (zv->zv_open_count > 0 ||
|
|
|
|
|
atomic_read(&zv->zv_suspend_ref)) {
|
zvol: ensure device minors are properly cleaned up
Currently, if a minor is in use when we try to remove it, we'll skip it
and never come back to it again. Since the zvol state is hung off the
minor in the kernel, this can get us into weird situations if something
tries to use it after the removal fails. It's even worse at pool export,
as there's now a vestigial zvol state with no pool under it. It's
weirder again if the pool is subsequently reimported, as the zvol code
(reasonably) assumes the zvol state has been properly setup, when it's
actually left over from the previous import of the pool.
This commit attempts to tackle that by setting a flag on the zvol if its
minor can't be removed, and then checking that flag when a request is
made and rejecting it, thus stopping new work coming in.
The flag also causes a condvar to be signaled when the last client
finishes. For the case where a single minor is being removed (eg
changing volmode), it will wait for this signal before proceeding.
Meanwhile, when removing all minors, a background task is created for
each minor that couldn't be removed on the spot, and those tasks then
wake and clean up.
Since any new tasks are queued on to the pool's spa_zvol_taskq,
spa_export_common() will continue to wait at export until all minors are
removed.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #14872
Closes #16364
2024-07-18 06:24:05 +03:00
|
|
|
t = taskq_dispatch(
|
|
|
|
|
zv->zv_objset->os_spa->spa_zvol_taskq,
|
|
|
|
|
zvol_remove_minor_task, zv, TQ_SLEEP);
|
|
|
|
|
if (t == TASKQID_INVALID) {
|
|
|
|
|
/*
|
|
|
|
|
* Couldn't create the task, so we'll
|
|
|
|
|
* do it in place once the loop is
|
|
|
|
|
* finished.
|
|
|
|
|
*/
|
|
|
|
|
list_insert_head(&delay_list, zv);
|
|
|
|
|
}
|
2017-06-15 21:08:45 +03:00
|
|
|
mutex_exit(&zv->zv_state_lock);
|
|
|
|
|
continue;
|
|
|
|
|
}
|
|
|
|
|
|
zvol: remove the OS-side minor before freeing the zvol
When destroying a zvol, it is not "unpublished" from the system (that
is, /dev/zd* node removed) until zvol_os_free(). Under Linux, at the
time del_gendisk() and put_disk() are called, the device node may still
be have an active hold, from a userspace program or something inside the
kernel (a partition probe). As it is currently, this can lead to calls
to zvol_open() or zvol_release() while the zvol_state_t is partially or
fully freed. zvol_open() has some protection against this by checking
that private_data is NULL, but zvol_release does not.
This implements a better ordering for all of this by adding a new
OS-side method, zvol_os_remove_minor(), which is responsible for fully
decoupling the "private" (OS-side) objects from the zvol_state_t. For
Linux, that means calling put_disk(), nulling private_data, and freeing
zv_zso.
This takes the place of zvol_os_clear_private(), which was a nod in that
direction but did not do enough, and did not do it early enough.
Equivalent changes are made on the FreeBSD side to follow the API
change.
Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17625
2025-08-05 06:43:17 +03:00
|
|
|
zvol_os_remove_minor(zv);
|
2013-12-07 02:20:22 +04:00
|
|
|
zvol_remove(zv);
|
2016-12-01 00:56:50 +03:00
|
|
|
|
2017-06-15 21:08:45 +03:00
|
|
|
/* Drop zv_state_lock before zvol_free() */
|
|
|
|
|
mutex_exit(&zv->zv_state_lock);
|
|
|
|
|
|
2018-06-16 01:05:21 +03:00
|
|
|
/* Try parallel zv_free, if failed do it in place */
|
2021-07-20 17:03:33 +03:00
|
|
|
t = taskq_dispatch(system_taskq, zvol_free_task, zv,
|
|
|
|
|
TQ_SLEEP);
|
2016-12-01 00:56:50 +03:00
|
|
|
if (t == TASKQID_INVALID)
|
2017-05-10 20:51:29 +03:00
|
|
|
list_insert_head(&free_list, zv);
|
2017-06-13 19:03:44 +03:00
|
|
|
} else {
|
|
|
|
|
mutex_exit(&zv->zv_state_lock);
|
2010-08-26 22:45:02 +04:00
|
|
|
}
|
|
|
|
|
}
|
2018-06-16 01:05:21 +03:00
|
|
|
rw_exit(&zvol_state_lock);
|
2017-05-10 20:51:29 +03:00
|
|
|
|
zvol: ensure device minors are properly cleaned up
Currently, if a minor is in use when we try to remove it, we'll skip it
and never come back to it again. Since the zvol state is hung off the
minor in the kernel, this can get us into weird situations if something
tries to use it after the removal fails. It's even worse at pool export,
as there's now a vestigial zvol state with no pool under it. It's
weirder again if the pool is subsequently reimported, as the zvol code
(reasonably) assumes the zvol state has been properly setup, when it's
actually left over from the previous import of the pool.
This commit attempts to tackle that by setting a flag on the zvol if its
minor can't be removed, and then checking that flag when a request is
made and rejecting it, thus stopping new work coming in.
The flag also causes a condvar to be signaled when the last client
finishes. For the case where a single minor is being removed (eg
changing volmode), it will wait for this signal before proceeding.
Meanwhile, when removing all minors, a background task is created for
each minor that couldn't be removed on the spot, and those tasks then
wake and clean up.
Since any new tasks are queued on to the pool's spa_zvol_taskq,
spa_export_common() will continue to wait at export until all minors are
removed.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #14872
Closes #16364
2024-07-18 06:24:05 +03:00
|
|
|
/* Wait for zvols that we couldn't create a remove task for */
|
|
|
|
|
while ((zv = list_remove_head(&delay_list)) != NULL)
|
|
|
|
|
zvol_remove_minor_task(zv);
|
|
|
|
|
|
|
|
|
|
/* Free any that we couldn't free in parallel earlier */
|
2023-06-09 20:12:52 +03:00
|
|
|
while ((zv = list_remove_head(&free_list)) != NULL)
|
2022-02-07 21:24:38 +03:00
|
|
|
zvol_os_free(zv);
|
2025-08-05 05:50:31 +03:00
|
|
|
|
|
|
|
|
if (error && task)
|
|
|
|
|
task->zt_error = SET_ERROR(error);
|
2010-08-26 22:45:02 +04:00
|
|
|
}
|
|
|
|
|
|
2017-07-12 23:05:37 +03:00
|
|
|
/* Remove minor for this specific volume only */
|
2025-08-06 17:10:52 +03:00
|
|
|
static int
|
2014-03-22 13:07:14 +04:00
|
|
|
zvol_remove_minor_impl(const char *name)
|
|
|
|
|
{
|
|
|
|
|
if (zvol_inhibit_dev)
|
2025-08-06 17:10:52 +03:00
|
|
|
return (0);
|
2014-03-22 13:07:14 +04:00
|
|
|
|
2025-08-05 05:50:31 +03:00
|
|
|
zvol_task_t task;
|
|
|
|
|
memset(&task, 0, sizeof (zvol_task_t));
|
|
|
|
|
strlcpy(task.zt_name1, name, sizeof (task.zt_name1));
|
|
|
|
|
task.zt_value = B_FALSE;
|
2017-06-15 21:08:45 +03:00
|
|
|
|
2025-08-05 05:50:31 +03:00
|
|
|
zvol_remove_minors_impl(&task);
|
2017-06-15 21:08:45 +03:00
|
|
|
|
2025-08-05 05:50:31 +03:00
|
|
|
return (task.zt_error);
|
2014-03-22 13:07:14 +04:00
|
|
|
}
|
|
|
|
|
|
2010-08-26 22:45:02 +04:00
|
|
|
/*
|
2013-12-07 02:20:22 +04:00
|
|
|
* Rename minors for specified dataset including children and snapshots.
|
2010-08-26 22:45:02 +04:00
|
|
|
*/
|
2014-03-22 13:07:14 +04:00
|
|
|
static void
|
2025-08-06 17:10:52 +03:00
|
|
|
zvol_rename_minors_impl(zvol_task_t *task)
|
2010-08-26 22:45:02 +04:00
|
|
|
{
|
|
|
|
|
zvol_state_t *zv, *zv_next;
|
2025-08-06 17:10:52 +03:00
|
|
|
const char *oldname = task->zt_name1;
|
|
|
|
|
const char *newname = task->zt_name2;
|
|
|
|
|
int total = 0, done = 0, last_error, error, oldnamelen;
|
2010-08-26 22:45:02 +04:00
|
|
|
|
2012-06-02 05:49:10 +04:00
|
|
|
if (zvol_inhibit_dev)
|
|
|
|
|
return;
|
|
|
|
|
|
2013-12-07 02:20:22 +04:00
|
|
|
oldnamelen = strlen(oldname);
|
2010-08-26 22:45:02 +04:00
|
|
|
|
2018-06-16 01:05:21 +03:00
|
|
|
rw_enter(&zvol_state_lock, RW_READER);
|
2013-12-07 02:20:22 +04:00
|
|
|
|
2010-08-26 22:45:02 +04:00
|
|
|
for (zv = list_head(&zvol_state_list); zv != NULL; zv = zv_next) {
|
|
|
|
|
zv_next = list_next(&zvol_state_list, zv);
|
|
|
|
|
|
2017-06-13 19:03:44 +03:00
|
|
|
mutex_enter(&zv->zv_state_lock);
|
|
|
|
|
|
2013-12-07 02:20:22 +04:00
|
|
|
if (strcmp(zv->zv_name, oldname) == 0) {
|
2025-08-06 17:10:52 +03:00
|
|
|
error = zvol_os_rename_minor(zv, newname);
|
2013-12-07 02:20:22 +04:00
|
|
|
} else if (strncmp(zv->zv_name, oldname, oldnamelen) == 0 &&
|
|
|
|
|
(zv->zv_name[oldnamelen] == '/' ||
|
|
|
|
|
zv->zv_name[oldnamelen] == '@')) {
|
2017-06-28 20:05:16 +03:00
|
|
|
char *name = kmem_asprintf("%s%c%s", newname,
|
2013-12-07 02:20:22 +04:00
|
|
|
zv->zv_name[oldnamelen],
|
|
|
|
|
zv->zv_name + oldnamelen + 1);
|
2025-08-06 17:10:52 +03:00
|
|
|
error = zvol_os_rename_minor(zv, name);
|
2019-10-10 19:47:06 +03:00
|
|
|
kmem_strfree(name);
|
2010-08-26 22:45:02 +04:00
|
|
|
}
|
2025-08-06 17:10:52 +03:00
|
|
|
if (error) {
|
|
|
|
|
last_error = error;
|
|
|
|
|
} else {
|
|
|
|
|
done++;
|
|
|
|
|
}
|
|
|
|
|
total++;
|
2017-06-13 19:03:44 +03:00
|
|
|
mutex_exit(&zv->zv_state_lock);
|
2010-08-26 22:45:02 +04:00
|
|
|
}
|
2013-12-07 02:20:22 +04:00
|
|
|
|
2018-06-16 01:05:21 +03:00
|
|
|
rw_exit(&zvol_state_lock);
|
2025-08-06 17:10:52 +03:00
|
|
|
zvol_task_update_status(task, total, done, last_error);
|
2010-08-26 22:45:02 +04:00
|
|
|
}
|
|
|
|
|
|
2014-03-22 13:07:14 +04:00
|
|
|
typedef struct zvol_snapdev_cb_arg {
|
2025-08-06 17:10:52 +03:00
|
|
|
zvol_task_t *task;
|
2014-03-22 13:07:14 +04:00
|
|
|
uint64_t snapdev;
|
|
|
|
|
} zvol_snapdev_cb_arg_t;
|
|
|
|
|
|
2013-02-14 03:11:59 +04:00
|
|
|
static int
|
2017-01-21 00:17:55 +03:00
|
|
|
zvol_set_snapdev_cb(const char *dsname, void *param)
|
|
|
|
|
{
|
2014-03-22 13:07:14 +04:00
|
|
|
zvol_snapdev_cb_arg_t *arg = param;
|
2025-08-06 17:10:52 +03:00
|
|
|
int error = 0;
|
2013-02-14 03:11:59 +04:00
|
|
|
|
|
|
|
|
if (strchr(dsname, '@') == NULL)
|
2013-12-07 02:20:22 +04:00
|
|
|
return (0);
|
2013-02-14 03:11:59 +04:00
|
|
|
|
2014-03-22 13:07:14 +04:00
|
|
|
switch (arg->snapdev) {
|
2013-02-14 03:11:59 +04:00
|
|
|
case ZFS_SNAPDEV_VISIBLE:
|
2025-08-06 17:10:52 +03:00
|
|
|
error = zvol_os_create_minor(dsname);
|
2013-02-14 03:11:59 +04:00
|
|
|
break;
|
|
|
|
|
case ZFS_SNAPDEV_HIDDEN:
|
2025-08-06 17:10:52 +03:00
|
|
|
error = zvol_remove_minor_impl(dsname);
|
2013-02-14 03:11:59 +04:00
|
|
|
break;
|
|
|
|
|
}
|
2013-12-07 02:20:22 +04:00
|
|
|
|
2025-08-06 17:10:52 +03:00
|
|
|
zvol_task_update_status(arg->task, 1, error == 0, error);
|
2013-12-07 02:20:22 +04:00
|
|
|
return (0);
|
2013-02-14 03:11:59 +04:00
|
|
|
}
|
|
|
|
|
|
2014-03-22 13:07:14 +04:00
|
|
|
static void
|
2025-08-06 17:10:52 +03:00
|
|
|
zvol_set_snapdev_impl(zvol_task_t *task)
|
2014-03-22 13:07:14 +04:00
|
|
|
{
|
2025-08-06 17:10:52 +03:00
|
|
|
const char *name = task->zt_name1;
|
|
|
|
|
uint64_t snapdev = task->zt_value;
|
|
|
|
|
|
|
|
|
|
zvol_snapdev_cb_arg_t arg = {task, snapdev};
|
2014-03-22 13:07:14 +04:00
|
|
|
fstrans_cookie_t cookie = spl_fstrans_mark();
|
|
|
|
|
/*
|
|
|
|
|
* The zvol_set_snapdev_sync() sets snapdev appropriately
|
|
|
|
|
* in the dataset hierarchy. Here, we only scan snapshots.
|
|
|
|
|
*/
|
|
|
|
|
dmu_objset_find(name, zvol_set_snapdev_cb, &arg, DS_FIND_SNAPSHOTS);
|
|
|
|
|
spl_fstrans_unmark(cookie);
|
|
|
|
|
}
|
|
|
|
|
|
2017-07-12 23:05:37 +03:00
|
|
|
static void
|
2025-08-06 17:10:52 +03:00
|
|
|
zvol_set_volmode_impl(zvol_task_t *task)
|
2017-07-12 23:05:37 +03:00
|
|
|
{
|
2025-08-06 17:10:52 +03:00
|
|
|
const char *name = task->zt_name1;
|
|
|
|
|
uint64_t volmode = task->zt_value;
|
2020-11-17 20:50:52 +03:00
|
|
|
fstrans_cookie_t cookie;
|
|
|
|
|
uint64_t old_volmode;
|
|
|
|
|
zvol_state_t *zv;
|
2025-08-06 17:10:52 +03:00
|
|
|
int error;
|
2017-07-12 23:05:37 +03:00
|
|
|
|
|
|
|
|
if (strchr(name, '@') != NULL)
|
|
|
|
|
return;
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* It's unfortunate we need to remove minors before we create new ones:
|
|
|
|
|
* this is necessary because our backing gendisk (zvol_state->zv_disk)
|
async zvol minor node creation interferes with receive
When we finish a zfs receive, dmu_recv_end_sync() calls
zvol_create_minors(async=TRUE). This kicks off some other threads that
create the minor device nodes (in /dev/zvol/poolname/...). These async
threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which
both call dmu_objset_own(), which puts a "long hold" on the dataset.
Since the zvol minor node creation is asynchronous, this can happen
after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have
completed.
After the first receive ioctl has completed, userland may attempt to do
another receive into the same dataset (e.g. the next incremental
stream). This second receive and the asynchronous minor node creation
can interfere with one another in several different ways, because they
both require exclusive access to the dataset:
1. When the second receive is finishing up, dmu_recv_end_check() does
dsl_dataset_handoff_check(), which can fail with EBUSY if the async
minor node creation already has a "long hold" on this dataset. This
causes the 2nd receive to fail.
2. The async udev rule can fail if zvol_id and/or systemd-udevd try to
open the device while the the second receive's async attempt at minor
node creation owns the dataset (via zvol_prefetch_minors_impl). This
causes the minor node (/dev/zd*) to exist, but the udev-generated
/dev/zvol/... to not exist.
3. The async minor node creation can silently fail with EBUSY if the
first receive's zvol_create_minor() trys to own the dataset while the
second receive's zvol_prefetch_minors_impl already owns the dataset.
To address these problems, this change synchronously creates the minor
node. To avoid the lock ordering problems that the asynchrony was
introduced to fix (see #3681), we create the minor nodes from open
context, with no locks held, rather than from syncing contex as was
originally done.
Implementation notes:
We generally do not need to traverse children or prefetch anything (e.g.
when running the recv, snapshot, create, or clone subcommands of zfs).
We only need recursion when importing/opening a pool and when loading
encryption keys. The existing recursive, asynchronous, prefetching code
is preserved for use in these cases.
Channel programs may need to create zvol minor nodes, when creating a
snapshot of a zvol with the snapdev property set. We figure out what
snapshots are created when running the LUA program in syncing context.
In this case we need to remember what snapshots were created, and then
try to create their minor nodes from open context, after the LUA code
has completed.
There are additional zvol use cases that asynchronously own the dataset,
which can cause similar problems. E.g. changing the volmode or snapdev
properties. These are less problematic because they are not recursive
and don't touch datasets that are not involved in the operation, there
is still potential for interference with subsequent operations. In the
future, these cases should be similarly converted to create the zvol
minor node synchronously from open context.
The async tasks of removing and renaming minors do not own the objset,
so they do not have this problem. However, it may make sense to also
convert these operations to happen synchronously from open context, in
the future.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-65948
Closes #7863
Closes #9885
2020-02-03 20:33:14 +03:00
|
|
|
* could be different when we set, for instance, volmode from "geom"
|
2017-07-12 23:05:37 +03:00
|
|
|
* to "dev" (or vice versa).
|
|
|
|
|
*/
|
2020-11-17 20:50:52 +03:00
|
|
|
zv = zvol_find_by_name(name, RW_NONE);
|
|
|
|
|
if (zv == NULL && volmode == ZFS_VOLMODE_NONE)
|
2025-08-06 17:10:52 +03:00
|
|
|
return;
|
2020-11-17 20:50:52 +03:00
|
|
|
if (zv != NULL) {
|
|
|
|
|
old_volmode = zv->zv_volmode;
|
|
|
|
|
mutex_exit(&zv->zv_state_lock);
|
|
|
|
|
if (old_volmode == volmode)
|
|
|
|
|
return;
|
|
|
|
|
zvol_wait_close(zv);
|
|
|
|
|
}
|
|
|
|
|
cookie = spl_fstrans_mark();
|
2017-07-12 23:05:37 +03:00
|
|
|
switch (volmode) {
|
|
|
|
|
case ZFS_VOLMODE_NONE:
|
2025-08-06 17:10:52 +03:00
|
|
|
error = zvol_remove_minor_impl(name);
|
2017-07-12 23:05:37 +03:00
|
|
|
break;
|
|
|
|
|
case ZFS_VOLMODE_GEOM:
|
|
|
|
|
case ZFS_VOLMODE_DEV:
|
2025-08-06 17:10:52 +03:00
|
|
|
error = zvol_remove_minor_impl(name);
|
|
|
|
|
/*
|
|
|
|
|
* The remove minor function call above, might be not
|
|
|
|
|
* needed, if volmode was switched from 'none' value.
|
|
|
|
|
* Ignore error in this case.
|
|
|
|
|
*/
|
|
|
|
|
if (error == ENOENT)
|
|
|
|
|
error = 0;
|
|
|
|
|
else if (error)
|
|
|
|
|
break;
|
|
|
|
|
error = zvol_os_create_minor(name);
|
2017-07-12 23:05:37 +03:00
|
|
|
break;
|
|
|
|
|
case ZFS_VOLMODE_DEFAULT:
|
2025-08-06 17:10:52 +03:00
|
|
|
error = zvol_remove_minor_impl(name);
|
2017-07-12 23:05:37 +03:00
|
|
|
if (zvol_volmode == ZFS_VOLMODE_NONE)
|
|
|
|
|
break;
|
|
|
|
|
else /* if zvol_volmode is invalid defaults to "geom" */
|
2025-08-06 17:10:52 +03:00
|
|
|
error = zvol_os_create_minor(name);
|
2017-07-12 23:05:37 +03:00
|
|
|
break;
|
|
|
|
|
}
|
2025-08-06 17:10:52 +03:00
|
|
|
zvol_task_update_status(task, 1, error == 0, error);
|
2017-07-12 23:05:37 +03:00
|
|
|
spl_fstrans_unmark(cookie);
|
|
|
|
|
}
|
|
|
|
|
|
2014-03-22 13:07:14 +04:00
|
|
|
/*
|
|
|
|
|
* The worker thread function performed asynchronously.
|
|
|
|
|
*/
|
|
|
|
|
static void
|
async zvol minor node creation interferes with receive
When we finish a zfs receive, dmu_recv_end_sync() calls
zvol_create_minors(async=TRUE). This kicks off some other threads that
create the minor device nodes (in /dev/zvol/poolname/...). These async
threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which
both call dmu_objset_own(), which puts a "long hold" on the dataset.
Since the zvol minor node creation is asynchronous, this can happen
after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have
completed.
After the first receive ioctl has completed, userland may attempt to do
another receive into the same dataset (e.g. the next incremental
stream). This second receive and the asynchronous minor node creation
can interfere with one another in several different ways, because they
both require exclusive access to the dataset:
1. When the second receive is finishing up, dmu_recv_end_check() does
dsl_dataset_handoff_check(), which can fail with EBUSY if the async
minor node creation already has a "long hold" on this dataset. This
causes the 2nd receive to fail.
2. The async udev rule can fail if zvol_id and/or systemd-udevd try to
open the device while the the second receive's async attempt at minor
node creation owns the dataset (via zvol_prefetch_minors_impl). This
causes the minor node (/dev/zd*) to exist, but the udev-generated
/dev/zvol/... to not exist.
3. The async minor node creation can silently fail with EBUSY if the
first receive's zvol_create_minor() trys to own the dataset while the
second receive's zvol_prefetch_minors_impl already owns the dataset.
To address these problems, this change synchronously creates the minor
node. To avoid the lock ordering problems that the asynchrony was
introduced to fix (see #3681), we create the minor nodes from open
context, with no locks held, rather than from syncing contex as was
originally done.
Implementation notes:
We generally do not need to traverse children or prefetch anything (e.g.
when running the recv, snapshot, create, or clone subcommands of zfs).
We only need recursion when importing/opening a pool and when loading
encryption keys. The existing recursive, asynchronous, prefetching code
is preserved for use in these cases.
Channel programs may need to create zvol minor nodes, when creating a
snapshot of a zvol with the snapdev property set. We figure out what
snapshots are created when running the LUA program in syncing context.
In this case we need to remember what snapshots were created, and then
try to create their minor nodes from open context, after the LUA code
has completed.
There are additional zvol use cases that asynchronously own the dataset,
which can cause similar problems. E.g. changing the volmode or snapdev
properties. These are less problematic because they are not recursive
and don't touch datasets that are not involved in the operation, there
is still potential for interference with subsequent operations. In the
future, these cases should be similarly converted to create the zvol
minor node synchronously from open context.
The async tasks of removing and renaming minors do not own the objset,
so they do not have this problem. However, it may make sense to also
convert these operations to happen synchronously from open context, in
the future.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-65948
Closes #7863
Closes #9885
2020-02-03 20:33:14 +03:00
|
|
|
zvol_task_cb(void *arg)
|
2014-03-22 13:07:14 +04:00
|
|
|
{
|
async zvol minor node creation interferes with receive
When we finish a zfs receive, dmu_recv_end_sync() calls
zvol_create_minors(async=TRUE). This kicks off some other threads that
create the minor device nodes (in /dev/zvol/poolname/...). These async
threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which
both call dmu_objset_own(), which puts a "long hold" on the dataset.
Since the zvol minor node creation is asynchronous, this can happen
after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have
completed.
After the first receive ioctl has completed, userland may attempt to do
another receive into the same dataset (e.g. the next incremental
stream). This second receive and the asynchronous minor node creation
can interfere with one another in several different ways, because they
both require exclusive access to the dataset:
1. When the second receive is finishing up, dmu_recv_end_check() does
dsl_dataset_handoff_check(), which can fail with EBUSY if the async
minor node creation already has a "long hold" on this dataset. This
causes the 2nd receive to fail.
2. The async udev rule can fail if zvol_id and/or systemd-udevd try to
open the device while the the second receive's async attempt at minor
node creation owns the dataset (via zvol_prefetch_minors_impl). This
causes the minor node (/dev/zd*) to exist, but the udev-generated
/dev/zvol/... to not exist.
3. The async minor node creation can silently fail with EBUSY if the
first receive's zvol_create_minor() trys to own the dataset while the
second receive's zvol_prefetch_minors_impl already owns the dataset.
To address these problems, this change synchronously creates the minor
node. To avoid the lock ordering problems that the asynchrony was
introduced to fix (see #3681), we create the minor nodes from open
context, with no locks held, rather than from syncing contex as was
originally done.
Implementation notes:
We generally do not need to traverse children or prefetch anything (e.g.
when running the recv, snapshot, create, or clone subcommands of zfs).
We only need recursion when importing/opening a pool and when loading
encryption keys. The existing recursive, asynchronous, prefetching code
is preserved for use in these cases.
Channel programs may need to create zvol minor nodes, when creating a
snapshot of a zvol with the snapdev property set. We figure out what
snapshots are created when running the LUA program in syncing context.
In this case we need to remember what snapshots were created, and then
try to create their minor nodes from open context, after the LUA code
has completed.
There are additional zvol use cases that asynchronously own the dataset,
which can cause similar problems. E.g. changing the volmode or snapdev
properties. These are less problematic because they are not recursive
and don't touch datasets that are not involved in the operation, there
is still potential for interference with subsequent operations. In the
future, these cases should be similarly converted to create the zvol
minor node synchronously from open context.
The async tasks of removing and renaming minors do not own the objset,
so they do not have this problem. However, it may make sense to also
convert these operations to happen synchronously from open context, in
the future.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-65948
Closes #7863
Closes #9885
2020-02-03 20:33:14 +03:00
|
|
|
zvol_task_t *task = arg;
|
2014-03-22 13:07:14 +04:00
|
|
|
|
2025-08-06 17:10:52 +03:00
|
|
|
switch (task->zt_op) {
|
|
|
|
|
case ZVOL_ASYNC_CREATE_MINORS:
|
|
|
|
|
zvol_create_minors_impl(task);
|
|
|
|
|
break;
|
2014-03-22 13:07:14 +04:00
|
|
|
case ZVOL_ASYNC_REMOVE_MINORS:
|
2025-08-06 17:10:52 +03:00
|
|
|
zvol_remove_minors_impl(task);
|
2014-03-22 13:07:14 +04:00
|
|
|
break;
|
|
|
|
|
case ZVOL_ASYNC_RENAME_MINORS:
|
2025-08-06 17:10:52 +03:00
|
|
|
zvol_rename_minors_impl(task);
|
2014-03-22 13:07:14 +04:00
|
|
|
break;
|
|
|
|
|
case ZVOL_ASYNC_SET_SNAPDEV:
|
2025-08-06 17:10:52 +03:00
|
|
|
zvol_set_snapdev_impl(task);
|
2017-07-12 23:05:37 +03:00
|
|
|
break;
|
|
|
|
|
case ZVOL_ASYNC_SET_VOLMODE:
|
2025-08-06 17:10:52 +03:00
|
|
|
zvol_set_volmode_impl(task);
|
2014-03-22 13:07:14 +04:00
|
|
|
break;
|
|
|
|
|
default:
|
|
|
|
|
VERIFY(0);
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
|
2025-08-06 17:10:52 +03:00
|
|
|
zvol_task_report_status(task);
|
|
|
|
|
kmem_free(task, sizeof (zvol_task_t));
|
2014-03-22 13:07:14 +04:00
|
|
|
}
|
|
|
|
|
|
2017-07-12 23:05:37 +03:00
|
|
|
typedef struct zvol_set_prop_int_arg {
|
2014-03-22 13:07:14 +04:00
|
|
|
const char *zsda_name;
|
|
|
|
|
uint64_t zsda_value;
|
|
|
|
|
zprop_source_t zsda_source;
|
2023-10-12 01:31:11 +03:00
|
|
|
zfs_prop_t zsda_prop;
|
2017-07-12 23:05:37 +03:00
|
|
|
} zvol_set_prop_int_arg_t;
|
2014-03-22 13:07:14 +04:00
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Sanity check the dataset for safe use by the sync task. No additional
|
|
|
|
|
* conditions are imposed.
|
|
|
|
|
*/
|
|
|
|
|
static int
|
2023-10-12 01:31:11 +03:00
|
|
|
zvol_set_common_check(void *arg, dmu_tx_t *tx)
|
2014-03-22 13:07:14 +04:00
|
|
|
{
|
2017-07-12 23:05:37 +03:00
|
|
|
zvol_set_prop_int_arg_t *zsda = arg;
|
2014-03-22 13:07:14 +04:00
|
|
|
dsl_pool_t *dp = dmu_tx_pool(tx);
|
|
|
|
|
dsl_dir_t *dd;
|
|
|
|
|
int error;
|
|
|
|
|
|
|
|
|
|
error = dsl_dir_hold(dp, zsda->zsda_name, FTAG, &dd, NULL);
|
|
|
|
|
if (error != 0)
|
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
|
|
dsl_dir_rele(dd, FTAG);
|
|
|
|
|
|
|
|
|
|
return (error);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
static int
|
2023-10-12 01:31:11 +03:00
|
|
|
zvol_set_common_sync_cb(dsl_pool_t *dp, dsl_dataset_t *ds, void *arg)
|
2014-03-22 13:07:14 +04:00
|
|
|
{
|
2023-10-12 01:31:11 +03:00
|
|
|
zvol_set_prop_int_arg_t *zsda = arg;
|
2023-11-28 00:16:59 +03:00
|
|
|
char dsname[ZFS_MAX_DATASET_NAME_LEN];
|
2014-03-22 13:07:14 +04:00
|
|
|
zvol_task_t *task;
|
2023-10-12 01:31:11 +03:00
|
|
|
uint64_t prop;
|
2014-03-22 13:07:14 +04:00
|
|
|
|
2023-10-12 01:31:11 +03:00
|
|
|
const char *prop_name = zfs_prop_to_name(zsda->zsda_prop);
|
2014-03-22 13:07:14 +04:00
|
|
|
dsl_dataset_name(ds, dsname);
|
|
|
|
|
|
2023-10-12 01:31:11 +03:00
|
|
|
if (dsl_prop_get_int_ds(ds, prop_name, &prop) != 0)
|
|
|
|
|
return (0);
|
2014-03-22 13:07:14 +04:00
|
|
|
|
2025-08-06 17:10:52 +03:00
|
|
|
task = kmem_zalloc(sizeof (zvol_task_t), KM_SLEEP);
|
|
|
|
|
if (zsda->zsda_prop == ZFS_PROP_VOLMODE) {
|
|
|
|
|
task->zt_op = ZVOL_ASYNC_SET_VOLMODE;
|
|
|
|
|
} else if (zsda->zsda_prop == ZFS_PROP_SNAPDEV) {
|
|
|
|
|
task->zt_op = ZVOL_ASYNC_SET_SNAPDEV;
|
|
|
|
|
} else {
|
|
|
|
|
kmem_free(task, sizeof (zvol_task_t));
|
2017-07-12 23:05:37 +03:00
|
|
|
return (0);
|
2025-08-06 17:10:52 +03:00
|
|
|
}
|
|
|
|
|
task->zt_value = prop;
|
|
|
|
|
strlcpy(task->zt_name1, dsname, sizeof (task->zt_name1));
|
2017-07-12 23:05:37 +03:00
|
|
|
(void) taskq_dispatch(dp->dp_spa->spa_zvol_taskq, zvol_task_cb,
|
|
|
|
|
task, TQ_SLEEP);
|
|
|
|
|
return (0);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
2023-10-12 01:31:11 +03:00
|
|
|
* Traverse all child datasets and apply the property appropriately.
|
2017-07-12 23:05:37 +03:00
|
|
|
* We call dsl_prop_set_sync_impl() here to set the value only on the toplevel
|
2023-10-12 01:31:11 +03:00
|
|
|
* dataset and read the effective "property" on every child in the callback
|
2017-07-12 23:05:37 +03:00
|
|
|
* function: this is because the value is not guaranteed to be the same in the
|
|
|
|
|
* whole dataset hierarchy.
|
|
|
|
|
*/
|
|
|
|
|
static void
|
2023-10-12 01:31:11 +03:00
|
|
|
zvol_set_common_sync(void *arg, dmu_tx_t *tx)
|
2017-07-12 23:05:37 +03:00
|
|
|
{
|
|
|
|
|
zvol_set_prop_int_arg_t *zsda = arg;
|
|
|
|
|
dsl_pool_t *dp = dmu_tx_pool(tx);
|
|
|
|
|
dsl_dir_t *dd;
|
|
|
|
|
dsl_dataset_t *ds;
|
|
|
|
|
int error;
|
|
|
|
|
|
|
|
|
|
VERIFY0(dsl_dir_hold(dp, zsda->zsda_name, FTAG, &dd, NULL));
|
|
|
|
|
|
|
|
|
|
error = dsl_dataset_hold(dp, zsda->zsda_name, FTAG, &ds);
|
|
|
|
|
if (error == 0) {
|
2023-10-12 01:31:11 +03:00
|
|
|
dsl_prop_set_sync_impl(ds, zfs_prop_to_name(zsda->zsda_prop),
|
2017-07-12 23:05:37 +03:00
|
|
|
zsda->zsda_source, sizeof (zsda->zsda_value), 1,
|
2023-11-28 00:16:59 +03:00
|
|
|
&zsda->zsda_value, tx);
|
2017-07-12 23:05:37 +03:00
|
|
|
dsl_dataset_rele(ds, FTAG);
|
|
|
|
|
}
|
|
|
|
|
|
2023-10-12 01:31:11 +03:00
|
|
|
dmu_objset_find_dp(dp, dd->dd_object, zvol_set_common_sync_cb,
|
2017-07-12 23:05:37 +03:00
|
|
|
zsda, DS_FIND_CHILDREN);
|
|
|
|
|
|
|
|
|
|
dsl_dir_rele(dd, FTAG);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
int
|
2023-10-12 01:31:11 +03:00
|
|
|
zvol_set_common(const char *ddname, zfs_prop_t prop, zprop_source_t source,
|
|
|
|
|
uint64_t val)
|
2017-07-12 23:05:37 +03:00
|
|
|
{
|
|
|
|
|
zvol_set_prop_int_arg_t zsda;
|
|
|
|
|
|
|
|
|
|
zsda.zsda_name = ddname;
|
|
|
|
|
zsda.zsda_source = source;
|
2023-10-12 01:31:11 +03:00
|
|
|
zsda.zsda_value = val;
|
|
|
|
|
zsda.zsda_prop = prop;
|
2017-07-12 23:05:37 +03:00
|
|
|
|
2023-10-12 01:31:11 +03:00
|
|
|
return (dsl_sync_task(ddname, zvol_set_common_check,
|
|
|
|
|
zvol_set_common_sync, &zsda, 0, ZFS_SPACE_CHECK_NONE));
|
2017-07-12 23:05:37 +03:00
|
|
|
}
|
|
|
|
|
|
2014-03-22 13:07:14 +04:00
|
|
|
void
|
2025-08-06 17:10:52 +03:00
|
|
|
zvol_create_minors(const char *name)
|
2014-03-22 13:07:14 +04:00
|
|
|
{
|
2025-08-06 17:10:52 +03:00
|
|
|
spa_t *spa;
|
2014-03-22 13:07:14 +04:00
|
|
|
zvol_task_t *task;
|
|
|
|
|
taskqid_t id;
|
|
|
|
|
|
2025-08-06 17:10:52 +03:00
|
|
|
if (spa_open(name, &spa, FTAG) != 0)
|
2014-03-22 13:07:14 +04:00
|
|
|
return;
|
2016-02-16 22:52:55 +03:00
|
|
|
|
2025-08-06 17:10:52 +03:00
|
|
|
task = kmem_zalloc(sizeof (zvol_task_t), KM_SLEEP);
|
|
|
|
|
task->zt_op = ZVOL_ASYNC_CREATE_MINORS;
|
|
|
|
|
strlcpy(task->zt_name1, name, sizeof (task->zt_name1));
|
|
|
|
|
id = taskq_dispatch(spa->spa_zvol_taskq, zvol_task_cb, task, TQ_SLEEP);
|
|
|
|
|
if (id != TASKQID_INVALID)
|
|
|
|
|
taskq_wait_id(spa->spa_zvol_taskq, id);
|
|
|
|
|
|
|
|
|
|
spa_close(spa, FTAG);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
void
|
|
|
|
|
zvol_remove_minors(spa_t *spa, const char *name, boolean_t async)
|
|
|
|
|
{
|
|
|
|
|
zvol_task_t *task;
|
|
|
|
|
taskqid_t id;
|
|
|
|
|
|
|
|
|
|
task = kmem_zalloc(sizeof (zvol_task_t), KM_SLEEP);
|
|
|
|
|
task->zt_op = ZVOL_ASYNC_REMOVE_MINORS;
|
|
|
|
|
strlcpy(task->zt_name1, name, sizeof (task->zt_name1));
|
2025-08-05 05:50:31 +03:00
|
|
|
task->zt_value = B_TRUE;
|
2014-03-22 13:07:14 +04:00
|
|
|
id = taskq_dispatch(spa->spa_zvol_taskq, zvol_task_cb, task, TQ_SLEEP);
|
2016-10-29 01:40:14 +03:00
|
|
|
if ((async == B_FALSE) && (id != TASKQID_INVALID))
|
2014-03-22 13:07:14 +04:00
|
|
|
taskq_wait_id(spa->spa_zvol_taskq, id);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
void
|
|
|
|
|
zvol_rename_minors(spa_t *spa, const char *name1, const char *name2,
|
|
|
|
|
boolean_t async)
|
|
|
|
|
{
|
|
|
|
|
zvol_task_t *task;
|
|
|
|
|
taskqid_t id;
|
|
|
|
|
|
2025-08-06 17:10:52 +03:00
|
|
|
task = kmem_zalloc(sizeof (zvol_task_t), KM_SLEEP);
|
|
|
|
|
task->zt_op = ZVOL_ASYNC_RENAME_MINORS;
|
|
|
|
|
strlcpy(task->zt_name1, name1, sizeof (task->zt_name1));
|
|
|
|
|
strlcpy(task->zt_name2, name2, sizeof (task->zt_name2));
|
2014-03-22 13:07:14 +04:00
|
|
|
id = taskq_dispatch(spa->spa_zvol_taskq, zvol_task_cb, task, TQ_SLEEP);
|
2016-10-29 01:40:14 +03:00
|
|
|
if ((async == B_FALSE) && (id != TASKQID_INVALID))
|
2014-03-22 13:07:14 +04:00
|
|
|
taskq_wait_id(spa->spa_zvol_taskq, id);
|
2013-02-14 03:11:59 +04:00
|
|
|
}
|
|
|
|
|
|
2019-09-25 19:20:30 +03:00
|
|
|
boolean_t
|
|
|
|
|
zvol_is_zvol(const char *name)
|
|
|
|
|
{
|
|
|
|
|
|
2022-02-07 21:24:38 +03:00
|
|
|
return (zvol_os_is_zvol(name));
|
2019-09-25 19:20:30 +03:00
|
|
|
}
|
|
|
|
|
|
2010-08-26 22:45:02 +04:00
|
|
|
int
|
2019-09-25 19:20:30 +03:00
|
|
|
zvol_init_impl(void)
|
2010-08-26 22:45:02 +04:00
|
|
|
{
|
2019-09-25 19:20:30 +03:00
|
|
|
int i;
|
2010-08-26 22:45:02 +04:00
|
|
|
|
2025-05-08 22:25:40 +03:00
|
|
|
/*
|
|
|
|
|
* zvol_threads is the module param the user passes in.
|
|
|
|
|
*
|
|
|
|
|
* zvol_actual_threads is what we use internally, since the user can
|
|
|
|
|
* pass zvol_thread = 0 to mean "use all the CPUs" (the default).
|
|
|
|
|
*/
|
|
|
|
|
static unsigned int zvol_actual_threads;
|
|
|
|
|
|
|
|
|
|
if (zvol_threads == 0) {
|
|
|
|
|
/*
|
|
|
|
|
* See dde9380a1 for why 32 was chosen here. This should
|
|
|
|
|
* probably be refined to be some multiple of the number
|
|
|
|
|
* of CPUs.
|
|
|
|
|
*/
|
|
|
|
|
zvol_actual_threads = MAX(max_ncpus, 32);
|
|
|
|
|
} else {
|
|
|
|
|
zvol_actual_threads = MIN(MAX(zvol_threads, 1), 1024);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Use at least 32 zvol_threads but for many core system,
|
|
|
|
|
* prefer 6 threads per taskq, but no more taskqs
|
|
|
|
|
* than threads in them on large systems.
|
|
|
|
|
*
|
|
|
|
|
* taskq total
|
|
|
|
|
* cpus taskqs threads threads
|
|
|
|
|
* ------- ------- ------- -------
|
|
|
|
|
* 1 1 32 32
|
|
|
|
|
* 2 1 32 32
|
|
|
|
|
* 4 1 32 32
|
|
|
|
|
* 8 2 16 32
|
|
|
|
|
* 16 3 11 33
|
|
|
|
|
* 32 5 7 35
|
|
|
|
|
* 64 8 8 64
|
|
|
|
|
* 128 11 12 132
|
|
|
|
|
* 256 16 16 256
|
|
|
|
|
*/
|
|
|
|
|
zv_taskq_t *ztqs = &zvol_taskqs;
|
|
|
|
|
int num_tqs = MIN(max_ncpus, zvol_num_taskqs);
|
|
|
|
|
if (num_tqs == 0) {
|
|
|
|
|
num_tqs = 1 + max_ncpus / 6;
|
|
|
|
|
while (num_tqs * num_tqs > zvol_actual_threads)
|
|
|
|
|
num_tqs--;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
int per_tq_thread = zvol_actual_threads / num_tqs;
|
|
|
|
|
if (per_tq_thread * num_tqs < zvol_actual_threads)
|
|
|
|
|
per_tq_thread++;
|
|
|
|
|
|
|
|
|
|
ztqs->tqs_cnt = num_tqs;
|
|
|
|
|
ztqs->tqs_taskq = kmem_alloc(num_tqs * sizeof (taskq_t *), KM_SLEEP);
|
|
|
|
|
|
|
|
|
|
for (uint_t i = 0; i < num_tqs; i++) {
|
|
|
|
|
char name[32];
|
|
|
|
|
(void) snprintf(name, sizeof (name), "%s_tq-%u",
|
|
|
|
|
ZVOL_DRIVER, i);
|
|
|
|
|
ztqs->tqs_taskq[i] = taskq_create(name, per_tq_thread,
|
|
|
|
|
maxclsyspri, per_tq_thread, INT_MAX,
|
|
|
|
|
TASKQ_PREPOPULATE | TASKQ_DYNAMIC);
|
|
|
|
|
if (ztqs->tqs_taskq[i] == NULL) {
|
|
|
|
|
for (int j = i - 1; j >= 0; j--)
|
|
|
|
|
taskq_destroy(ztqs->tqs_taskq[j]);
|
|
|
|
|
kmem_free(ztqs->tqs_taskq, ztqs->tqs_cnt *
|
|
|
|
|
sizeof (taskq_t *));
|
|
|
|
|
ztqs->tqs_taskq = NULL;
|
|
|
|
|
return (SET_ERROR(ENOMEM));
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
2013-07-02 22:59:10 +04:00
|
|
|
list_create(&zvol_state_list, sizeof (zvol_state_t),
|
2013-12-13 01:04:40 +04:00
|
|
|
offsetof(zvol_state_t, zv_next));
|
2018-06-16 01:05:21 +03:00
|
|
|
rw_init(&zvol_state_lock, NULL, RW_DEFAULT, NULL);
|
2017-02-23 03:08:04 +03:00
|
|
|
|
2016-12-01 00:56:50 +03:00
|
|
|
zvol_htable = kmem_alloc(ZVOL_HT_SIZE * sizeof (struct hlist_head),
|
|
|
|
|
KM_SLEEP);
|
|
|
|
|
for (i = 0; i < ZVOL_HT_SIZE; i++)
|
|
|
|
|
INIT_HLIST_HEAD(&zvol_htable[i]);
|
|
|
|
|
|
2010-08-26 22:45:02 +04:00
|
|
|
return (0);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
void
|
2019-09-25 19:20:30 +03:00
|
|
|
zvol_fini_impl(void)
|
2010-08-26 22:45:02 +04:00
|
|
|
{
|
2025-05-08 22:25:40 +03:00
|
|
|
zv_taskq_t *ztqs = &zvol_taskqs;
|
|
|
|
|
|
2019-10-25 23:42:54 +03:00
|
|
|
zvol_remove_minors_impl(NULL);
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* The call to "zvol_remove_minors_impl" may dispatch entries to
|
2020-06-10 07:24:09 +03:00
|
|
|
* the system_taskq, but it doesn't wait for those entries to
|
2019-10-25 23:42:54 +03:00
|
|
|
* complete before it returns. Thus, we must wait for all of the
|
|
|
|
|
* removals to finish, before we can continue.
|
|
|
|
|
*/
|
|
|
|
|
taskq_wait_outstanding(system_taskq, 0);
|
|
|
|
|
|
2016-12-01 00:56:50 +03:00
|
|
|
kmem_free(zvol_htable, ZVOL_HT_SIZE * sizeof (struct hlist_head));
|
2010-08-26 22:45:02 +04:00
|
|
|
list_destroy(&zvol_state_list);
|
2018-06-16 01:05:21 +03:00
|
|
|
rw_destroy(&zvol_state_lock);
|
2025-05-08 22:25:40 +03:00
|
|
|
|
|
|
|
|
if (ztqs->tqs_taskq == NULL) {
|
2025-08-04 13:22:42 +03:00
|
|
|
ASSERT0(ztqs->tqs_cnt);
|
2025-05-08 22:25:40 +03:00
|
|
|
} else {
|
|
|
|
|
for (uint_t i = 0; i < ztqs->tqs_cnt; i++) {
|
|
|
|
|
ASSERT3P(ztqs->tqs_taskq[i], !=, NULL);
|
|
|
|
|
taskq_destroy(ztqs->tqs_taskq[i]);
|
|
|
|
|
}
|
|
|
|
|
kmem_free(ztqs->tqs_taskq, ztqs->tqs_cnt *
|
|
|
|
|
sizeof (taskq_t *));
|
|
|
|
|
ztqs->tqs_taskq = NULL;
|
|
|
|
|
}
|
2010-08-26 22:45:02 +04:00
|
|
|
}
|
2025-05-08 22:25:40 +03:00
|
|
|
|
2025-05-29 16:37:41 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_vol, zvol_, inhibit_dev, UINT, ZMOD_RW,
|
|
|
|
|
"Do not create zvol device nodes");
|
2025-05-31 16:58:54 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_vol, zvol_, prefetch_bytes, UINT, ZMOD_RW,
|
|
|
|
|
"Prefetch N bytes at zvol start+end");
|
2025-06-01 02:09:50 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_vol, zvol_vol, mode, UINT, ZMOD_RW,
|
|
|
|
|
"Default volmode property value");
|
2025-05-30 23:41:15 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_vol, zvol_, threads, UINT, ZMOD_RW,
|
2025-05-08 22:25:40 +03:00
|
|
|
"Number of threads for I/O requests. Set to 0 to use all active CPUs");
|
2025-05-30 23:41:15 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_vol, zvol_, num_taskqs, UINT, ZMOD_RW,
|
2025-05-08 22:25:40 +03:00
|
|
|
"Number of zvol taskqs");
|
2025-05-30 23:41:15 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_vol, zvol_, request_sync, UINT, ZMOD_RW,
|
2025-05-08 22:25:40 +03:00
|
|
|
"Synchronously handle bio requests");
|