zvol: stop using zvol_state_lock to protect OS-side private data

zvol_state_lock is intended to protect access to the global name->zvol
lists (zvol_find_by_name()), but has also been used to control access to
OS-side private data, accessed through whatever kernel object is used to
represent the volume (gendisk, geom, etc).

This appears to have been necessary to some degree because the OS-side
object is what's used to get a handle on zvol_state_t, so zv_state_lock
and zv_suspend_lock can't be used to manage access, but also, with the
private object and the zvol_state_t being shutdown and destroyed at the
same time in zvol_os_free(), we must ensure that the private object
pointer only ever corresponds to a real zvol_state_t, not one in partial
destruction. Taking the global lock seems like a convenient way to
ensure this.

The problem with this is that zvol_state_lock does not actually protect
access to the zvol_state_t internals, so we need to take zv_state_lock
and/or zv_suspend_lock. If those are contended, this can then cause
OS-side operations (eg zvol_open()) to sleep to wait for them while hold
zvol_state_lock. This then blocks out all other OS-side operations which
want to get the private data, and any ZFS-side control operations that
would take the write half of the lock. It's even worse if ZFS-side
operations induce OS-side calls back into the zvol (eg creating a zvol
triggers a partition probe inside the kernel, and also a userspace
access from udev to set up device links). And it gets even works again
if anything decides to defer those ops to a task and wait on them, which
zvol_remove_minors_impl() will do under high load.

However, since the previous commit, we have a guarantee that the private
data pointer will always be NULL'd out in zvol_os_remove_minor()
_before_ the zvol_state_t is made invalid, but it won't happen until all
users are ejected. So, if we make access to the private object pointer
atomic, we remove the need to take a global lockout to access it, and so
we can remove all acquisitions of zvol_state_lock from the OS side.

While here, I've rewritten much of the locking theory comment at the top
of zvol.c. It wasn't wrong, but it hadn't been followed exactly, so I've
tried to describe the purpose of each lock in a little more detail, and
in particular describe where it should and shouldn't be used.

Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17625
This commit is contained in:
Rob Norris 2025-08-05 14:19:24 +10:00 committed by Brian Behlendorf
parent 96f9d271ea
commit 8a0e5e8b54
3 changed files with 134 additions and 113 deletions

View File

@ -225,25 +225,14 @@ zvol_geom_open(struct g_provider *pp, int flag, int count)
}
retry:
rw_enter(&zvol_state_lock, ZVOL_RW_READER);
/*
* Obtain a copy of private under zvol_state_lock to make sure either
* the result of zvol free code setting private to NULL is observed,
* or the zv is protected from being freed because of the positive
* zv_open_count.
*/
zv = pp->private;
if (zv == NULL) {
rw_exit(&zvol_state_lock);
err = SET_ERROR(ENXIO);
goto out_locked;
}
zv = atomic_load_ptr(&pp->private);
if (zv == NULL)
return (SET_ERROR(ENXIO));
mutex_enter(&zv->zv_state_lock);
if (zv->zv_zso->zso_dying || zv->zv_flags & ZVOL_REMOVING) {
rw_exit(&zvol_state_lock);
err = SET_ERROR(ENXIO);
goto out_zv_locked;
goto out_locked;
}
ASSERT3S(zv->zv_volmode, ==, ZFS_VOLMODE_GEOM);
@ -256,8 +245,24 @@ retry:
drop_suspend = B_TRUE;
if (!rw_tryenter(&zv->zv_suspend_lock, ZVOL_RW_READER)) {
mutex_exit(&zv->zv_state_lock);
/*
* Removal may happen while the locks are down, so
* we can't trust zv any longer; we have to start over.
*/
zv = atomic_load_ptr(&pp->private);
if (zv == NULL)
return (SET_ERROR(ENXIO));
rw_enter(&zv->zv_suspend_lock, ZVOL_RW_READER);
mutex_enter(&zv->zv_state_lock);
if (zv->zv_zso->zso_dying ||
zv->zv_flags & ZVOL_REMOVING) {
err = SET_ERROR(ENXIO);
goto out_locked;
}
/* Check to see if zv_suspend_lock is needed. */
if (zv->zv_open_count != 0) {
rw_exit(&zv->zv_suspend_lock);
@ -265,7 +270,6 @@ retry:
}
}
}
rw_exit(&zvol_state_lock);
ASSERT(MUTEX_HELD(&zv->zv_state_lock));
@ -293,7 +297,7 @@ retry:
if (drop_namespace)
mutex_exit(&spa_namespace_lock);
if (err)
goto out_zv_locked;
goto out_locked;
pp->mediasize = zv->zv_volsize;
pp->stripeoffset = 0;
pp->stripesize = zv->zv_volblocksize;
@ -328,9 +332,8 @@ out_opened:
zvol_last_close(zv);
wakeup(zv);
}
out_zv_locked:
mutex_exit(&zv->zv_state_lock);
out_locked:
mutex_exit(&zv->zv_state_lock);
if (drop_suspend)
rw_exit(&zv->zv_suspend_lock);
return (err);
@ -344,12 +347,9 @@ zvol_geom_close(struct g_provider *pp, int flag, int count)
boolean_t drop_suspend = B_TRUE;
int new_open_count;
rw_enter(&zvol_state_lock, ZVOL_RW_READER);
zv = pp->private;
if (zv == NULL) {
rw_exit(&zvol_state_lock);
zv = atomic_load_ptr(&pp->private);
if (zv == NULL)
return (SET_ERROR(ENXIO));
}
mutex_enter(&zv->zv_state_lock);
if (zv->zv_flags & ZVOL_EXCL) {
@ -376,6 +376,15 @@ zvol_geom_close(struct g_provider *pp, int flag, int count)
mutex_exit(&zv->zv_state_lock);
rw_enter(&zv->zv_suspend_lock, ZVOL_RW_READER);
mutex_enter(&zv->zv_state_lock);
/*
* Unlike in zvol_geom_open(), we don't check if
* removal started here, because we might be one of the
* openers that needs to be thrown out! If we're the
* last, we need to call zvol_last_close() below to
* finish cleanup. So, no special treatment for us.
*/
/* Check to see if zv_suspend_lock is needed. */
new_open_count = zv->zv_open_count - count;
if (new_open_count != 0) {
@ -386,7 +395,6 @@ zvol_geom_close(struct g_provider *pp, int flag, int count)
} else {
drop_suspend = B_FALSE;
}
rw_exit(&zvol_state_lock);
ASSERT(MUTEX_HELD(&zv->zv_state_lock));
@ -439,7 +447,7 @@ zvol_geom_access(struct g_provider *pp, int acr, int acw, int ace)
("Unsupported access request to %s (acr=%d, acw=%d, ace=%d).",
pp->name, acr, acw, ace));
if (pp->private == NULL) {
if (atomic_load_ptr(&pp->private) == NULL) {
if (acr <= 0 && acw <= 0 && ace <= 0)
return (0);
return (pp->error);
@ -906,25 +914,14 @@ zvol_cdev_open(struct cdev *dev, int flags, int fmt, struct thread *td)
boolean_t drop_suspend = B_FALSE;
retry:
rw_enter(&zvol_state_lock, ZVOL_RW_READER);
/*
* Obtain a copy of si_drv2 under zvol_state_lock to make sure either
* the result of zvol free code setting si_drv2 to NULL is observed,
* or the zv is protected from being freed because of the positive
* zv_open_count.
*/
zv = dev->si_drv2;
if (zv == NULL) {
rw_exit(&zvol_state_lock);
err = SET_ERROR(ENXIO);
goto out_locked;
}
zv = atomic_load_ptr(&dev->si_drv2);
if (zv == NULL)
return (SET_ERROR(ENXIO));
mutex_enter(&zv->zv_state_lock);
if (zv->zv_zso->zso_dying) {
rw_exit(&zvol_state_lock);
if (zv->zv_zso->zso_dying || zv->zv_flags & ZVOL_REMOVING) {
err = SET_ERROR(ENXIO);
goto out_zv_locked;
goto out_locked;
}
ASSERT3S(zv->zv_volmode, ==, ZFS_VOLMODE_DEV);
@ -939,6 +936,13 @@ retry:
mutex_exit(&zv->zv_state_lock);
rw_enter(&zv->zv_suspend_lock, ZVOL_RW_READER);
mutex_enter(&zv->zv_state_lock);
if (unlikely(zv->zv_flags & ZVOL_REMOVING)) {
/* Removal started while locks were down. */
err = SET_ERROR(ENXIO);
goto out_locked;
}
/* Check to see if zv_suspend_lock is needed. */
if (zv->zv_open_count != 0) {
rw_exit(&zv->zv_suspend_lock);
@ -946,7 +950,6 @@ retry:
}
}
}
rw_exit(&zvol_state_lock);
ASSERT(MUTEX_HELD(&zv->zv_state_lock));
@ -974,7 +977,7 @@ retry:
if (drop_namespace)
mutex_exit(&spa_namespace_lock);
if (err)
goto out_zv_locked;
goto out_locked;
}
ASSERT(MUTEX_HELD(&zv->zv_state_lock));
@ -1001,9 +1004,8 @@ out_opened:
zvol_last_close(zv);
wakeup(zv);
}
out_zv_locked:
mutex_exit(&zv->zv_state_lock);
out_locked:
mutex_exit(&zv->zv_state_lock);
if (drop_suspend)
rw_exit(&zv->zv_suspend_lock);
return (err);
@ -1015,12 +1017,9 @@ zvol_cdev_close(struct cdev *dev, int flags, int fmt, struct thread *td)
zvol_state_t *zv;
boolean_t drop_suspend = B_TRUE;
rw_enter(&zvol_state_lock, ZVOL_RW_READER);
zv = dev->si_drv2;
if (zv == NULL) {
rw_exit(&zvol_state_lock);
zv = atomic_load_ptr(&dev->si_drv2);
if (zv == NULL)
return (SET_ERROR(ENXIO));
}
mutex_enter(&zv->zv_state_lock);
if (zv->zv_flags & ZVOL_EXCL) {
@ -1045,6 +1044,15 @@ zvol_cdev_close(struct cdev *dev, int flags, int fmt, struct thread *td)
mutex_exit(&zv->zv_state_lock);
rw_enter(&zv->zv_suspend_lock, ZVOL_RW_READER);
mutex_enter(&zv->zv_state_lock);
/*
* Unlike in zvol_cdev_open(), we don't check if
* removal started here, because we might be one of the
* openers that needs to be thrown out! If we're the
* last, we need to call zvol_last_close() below to
* finish cleanup. So, no special treatment for us.
*/
/* Check to see if zv_suspend_lock is needed. */
if (zv->zv_open_count != 1) {
rw_exit(&zv->zv_suspend_lock);
@ -1054,7 +1062,6 @@ zvol_cdev_close(struct cdev *dev, int flags, int fmt, struct thread *td)
} else {
drop_suspend = B_FALSE;
}
rw_exit(&zvol_state_lock);
ASSERT(MUTEX_HELD(&zv->zv_state_lock));
@ -1086,7 +1093,8 @@ zvol_cdev_ioctl(struct cdev *dev, ulong_t cmd, caddr_t data,
int error;
boolean_t sync;
zv = dev->si_drv2;
zv = atomic_load_ptr(&dev->si_drv2);
ASSERT3P(zv, !=, NULL);
error = 0;
KASSERT(zv->zv_open_count > 0,
@ -1147,6 +1155,7 @@ zvol_cdev_ioctl(struct cdev *dev, ulong_t cmd, caddr_t data,
*(off_t *)data = 0;
break;
case DIOCGATTR: {
rw_enter(&zv->zv_suspend_lock, ZVOL_RW_READER);
spa_t *spa = dmu_objset_spa(zv->zv_objset);
struct diocgattr_arg *arg = (struct diocgattr_arg *)data;
uint64_t refd, avail, usedobjs, availobjs;
@ -1171,6 +1180,7 @@ zvol_cdev_ioctl(struct cdev *dev, ulong_t cmd, caddr_t data,
arg->value.off = refd / DEV_BSIZE;
} else
error = SET_ERROR(ENOIOCTL);
rw_exit(&zv->zv_suspend_lock);
break;
}
case FIOSEEKHOLE:
@ -1181,10 +1191,12 @@ zvol_cdev_ioctl(struct cdev *dev, ulong_t cmd, caddr_t data,
hole = (cmd == FIOSEEKHOLE);
noff = *off;
rw_enter(&zv->zv_suspend_lock, ZVOL_RW_READER);
lr = zfs_rangelock_enter(&zv->zv_rangelock, 0, UINT64_MAX,
RL_READER);
error = dmu_offset_next(zv->zv_objset, ZVOL_OBJ, hole, &noff);
zfs_rangelock_exit(lr);
rw_exit(&zv->zv_suspend_lock);
*off = noff;
break;
}
@ -1398,7 +1410,7 @@ zvol_os_remove_minor(zvol_state_t *zv)
if (zv->zv_volmode == ZFS_VOLMODE_GEOM) {
struct zvol_state_geom *zsg = &zso->zso_geom;
struct g_provider *pp = zsg->zsg_provider;
pp->private = NULL;
atomic_store_ptr(&pp->private, NULL);
mutex_exit(&zv->zv_state_lock);
g_topology_lock();
@ -1409,7 +1421,7 @@ zvol_os_remove_minor(zvol_state_t *zv)
struct cdev *dev = zsd->zsd_cdev;
if (dev != NULL)
dev->si_drv2 = NULL;
atomic_store_ptr(&dev->si_drv2, NULL);
mutex_exit(&zv->zv_state_lock);
if (dev != NULL) {

View File

@ -679,28 +679,19 @@ zvol_open(struct block_device *bdev, fmode_t flag)
retry:
#endif
rw_enter(&zvol_state_lock, RW_READER);
/*
* Obtain a copy of private_data under the zvol_state_lock to make
* sure that either the result of zvol free code path setting
* disk->private_data to NULL is observed, or zvol_os_free()
* is not called on this zv because of the positive zv_open_count.
*/
#ifdef HAVE_BLK_MODE_T
zv = disk->private_data;
zv = atomic_load_ptr(&disk->private_data);
#else
zv = bdev->bd_disk->private_data;
zv = atomic_load_ptr(&bdev->bd_disk->private_data);
#endif
if (zv == NULL) {
rw_exit(&zvol_state_lock);
return (-SET_ERROR(ENXIO));
}
mutex_enter(&zv->zv_state_lock);
if (unlikely(zv->zv_flags & ZVOL_REMOVING)) {
mutex_exit(&zv->zv_state_lock);
rw_exit(&zvol_state_lock);
return (-SET_ERROR(ENXIO));
}
@ -712,8 +703,28 @@ retry:
if (zv->zv_open_count == 0) {
if (!rw_tryenter(&zv->zv_suspend_lock, RW_READER)) {
mutex_exit(&zv->zv_state_lock);
/*
* Removal may happen while the locks are down, so
* we can't trust zv any longer; we have to start over.
*/
#ifdef HAVE_BLK_MODE_T
zv = atomic_load_ptr(&disk->private_data);
#else
zv = atomic_load_ptr(&bdev->bd_disk->private_data);
#endif
if (zv == NULL)
return (-SET_ERROR(ENXIO));
rw_enter(&zv->zv_suspend_lock, RW_READER);
mutex_enter(&zv->zv_state_lock);
if (unlikely(zv->zv_flags & ZVOL_REMOVING)) {
mutex_exit(&zv->zv_state_lock);
rw_exit(&zv->zv_suspend_lock);
return (-SET_ERROR(ENXIO));
}
/* check to see if zv_suspend_lock is needed */
if (zv->zv_open_count != 0) {
rw_exit(&zv->zv_suspend_lock);
@ -724,7 +735,6 @@ retry:
drop_suspend = B_TRUE;
}
}
rw_exit(&zvol_state_lock);
ASSERT(MUTEX_HELD(&zv->zv_state_lock));
@ -821,11 +831,11 @@ zvol_release(struct gendisk *disk, fmode_t unused)
#if !defined(HAVE_BLOCK_DEVICE_OPERATIONS_RELEASE_1ARG)
(void) unused;
#endif
zvol_state_t *zv;
boolean_t drop_suspend = B_TRUE;
rw_enter(&zvol_state_lock, RW_READER);
zv = disk->private_data;
zvol_state_t *zv = atomic_load_ptr(&disk->private_data);
if (zv == NULL)
return;
mutex_enter(&zv->zv_state_lock);
ASSERT3U(zv->zv_open_count, >, 0);
@ -839,6 +849,15 @@ zvol_release(struct gendisk *disk, fmode_t unused)
mutex_exit(&zv->zv_state_lock);
rw_enter(&zv->zv_suspend_lock, RW_READER);
mutex_enter(&zv->zv_state_lock);
/*
* Unlike in zvol_open(), we don't check if removal
* started here, because we might be one of the openers
* that needs to be thrown out! If we're the last, we
* need to call zvol_last_close() below to finish
* cleanup. So, no special treatment for us.
*/
/* check to see if zv_suspend_lock is needed */
if (zv->zv_open_count != 1) {
rw_exit(&zv->zv_suspend_lock);
@ -848,7 +867,6 @@ zvol_release(struct gendisk *disk, fmode_t unused)
} else {
drop_suspend = B_FALSE;
}
rw_exit(&zvol_state_lock);
ASSERT(MUTEX_HELD(&zv->zv_state_lock));
@ -868,9 +886,10 @@ static int
zvol_ioctl(struct block_device *bdev, fmode_t mode,
unsigned int cmd, unsigned long arg)
{
zvol_state_t *zv = bdev->bd_disk->private_data;
int error = 0;
zvol_state_t *zv = atomic_load_ptr(&bdev->bd_disk->private_data);
ASSERT3P(zv, !=, NULL);
ASSERT3U(zv->zv_open_count, >, 0);
switch (cmd) {
@ -923,9 +942,8 @@ zvol_check_events(struct gendisk *disk, unsigned int clearing)
{
unsigned int mask = 0;
rw_enter(&zvol_state_lock, RW_READER);
zvol_state_t *zv = atomic_load_ptr(&disk->private_data);
zvol_state_t *zv = disk->private_data;
if (zv != NULL) {
mutex_enter(&zv->zv_state_lock);
mask = zv->zv_changed ? DISK_EVENT_MEDIA_CHANGE : 0;
@ -933,17 +951,14 @@ zvol_check_events(struct gendisk *disk, unsigned int clearing)
mutex_exit(&zv->zv_state_lock);
}
rw_exit(&zvol_state_lock);
return (mask);
}
static int
zvol_revalidate_disk(struct gendisk *disk)
{
rw_enter(&zvol_state_lock, RW_READER);
zvol_state_t *zv = atomic_load_ptr(&disk->private_data);
zvol_state_t *zv = disk->private_data;
if (zv != NULL) {
mutex_enter(&zv->zv_state_lock);
set_capacity(zv->zv_zso->zvo_disk,
@ -951,8 +966,6 @@ zvol_revalidate_disk(struct gendisk *disk)
mutex_exit(&zv->zv_state_lock);
}
rw_exit(&zvol_state_lock);
return (0);
}
@ -980,9 +993,10 @@ zvol_os_update_volsize(zvol_state_t *zv, uint64_t volsize)
static int
zvol_getgeo(struct block_device *bdev, struct hd_geometry *geo)
{
zvol_state_t *zv = bdev->bd_disk->private_data;
sector_t sectors;
zvol_state_t *zv = atomic_load_ptr(&bdev->bd_disk->private_data);
ASSERT3P(zv, !=, NULL);
ASSERT3U(zv->zv_open_count, >, 0);
sectors = get_capacity(zv->zv_zso->zvo_disk);
@ -1415,15 +1429,11 @@ zvol_os_remove_minor(zvol_state_t *zv)
ASSERT0(atomic_read(&zv->zv_suspend_ref));
ASSERT(zv->zv_flags & ZVOL_REMOVING);
/*
* Cleared while holding zvol_state_lock as a writer
* which will prevent zvol_open() from opening it.
*/
struct zvol_state_os *zso = zv->zv_zso;
zv->zv_zso = NULL;
/* Clearing private_data will make new callers return immediately. */
zso->zvo_disk->private_data = NULL;
atomic_store_ptr(&zso->zvo_disk->private_data, NULL);
/*
* Drop the state lock before calling del_gendisk(). There may be
@ -1455,17 +1465,6 @@ zvol_os_remove_minor(zvol_state_t *zv)
mutex_enter(&zv->zv_state_lock);
}
/*
* Cleanup then free a zvol_state_t which was created by zvol_alloc().
* At this time, the structure is not opened by anyone, is taken off
* the zvol_state_list, and has its private data set to NULL.
* The zvol_state_lock is dropped.
*
* This function may take many milliseconds to complete (e.g. we've seen
* it take over 256ms), due to the calls to "blk_cleanup_queue" and
* "del_gendisk". Thus, consumers need to be careful to account for this
* latency when calling this function.
*/
void
zvol_os_free(zvol_state_t *zv)
{

View File

@ -44,19 +44,30 @@
/*
* Note on locking of zvol state structures.
*
* These structures are used to maintain internal state used to emulate block
* devices on top of zvols. In particular, management of device minor number
* operations - create, remove, rename, and set_snapdev - involves access to
* these structures. The zvol_state_lock is primarily used to protect the
* zvol_state_list. The zv->zv_state_lock is used to protect the contents
* of the zvol_state_t structures, as well as to make sure that when the
* time comes to remove the structure from the list, it is not in use, and
* therefore, it can be taken off zvol_state_list and freed.
* zvol_state_t represents the connection between a single dataset
* (DMU_OST_ZVOL) and the device "minor" (some OS-specific representation of a
* "disk" or "device" or "volume", eg, a /dev/zdXX node, a GEOM object, etc).
*
* The zv_suspend_lock was introduced to allow for suspending I/O to a zvol,
* e.g. for the duration of receive and rollback operations. This lock can be
* held for significant periods of time. Given that it is undesirable to hold
* mutexes for long periods of time, the following lock ordering applies:
* The global zvol_state_lock is used to protect access to zvol_state_list and
* zvol_htable, which are the primary way to obtain a zvol_state_t from a name.
* It should not be used for anything not name-relateds, and you should avoid
* sleeping or waiting while its held. See zvol_find_by_name(), zvol_insert(),
* zvol_remove().
*
* The zv_state_lock is used to protect the contents of the associated
* zvol_state_t. Most of the zvol_state_t is dedicated to control and
* configuration; almost none of it is needed for data operations (that is,
* read, write, flush) so this lock is rarely taken during general IO. It
* should be released quickly; you should avoid sleeping or waiting while its
* held.
*
* zv_suspend_lock is used to suspend IO/data operations to a zvol. The read
* half should held for the duration of an IO operation. The write half should
* be taken when something to wait for IO to complete and the block further IO,
* eg for the duration of receive and rollback operations. This lock can be
* held for long periods of time.
*
* Thus, the following lock ordering appies.
* - take zvol_state_lock if necessary, to protect zvol_state_list
* - take zv_suspend_lock if necessary, by the code path in question
* - take zv_state_lock to protect zvol_state_t
@ -67,9 +78,8 @@
* these operations are serialized per pool. Consequently, we can be certain
* that for a given zvol, there is only one operation at a time in progress.
* That is why one can be sure that first, zvol_state_t for a given zvol is
* allocated and placed on zvol_state_list, and then other minor operations
* for this zvol are going to proceed in the order of issue.
*
* allocated and placed on zvol_state_list, and then other minor operations for
* this zvol are going to proceed in the order of issue.
*/
#include <sys/dataset_kstats.h>