zvol: stop using zvol_state_lock to protect OS-side private data

zvol_state_lock is intended to protect access to the global name->zvol
lists (zvol_find_by_name()), but has also been used to control access to
OS-side private data, accessed through whatever kernel object is used to
represent the volume (gendisk, geom, etc).

This appears to have been necessary to some degree because the OS-side
object is what's used to get a handle on zvol_state_t, so zv_state_lock
and zv_suspend_lock can't be used to manage access, but also, with the
private object and the zvol_state_t being shutdown and destroyed at the
same time in zvol_os_free(), we must ensure that the private object
pointer only ever corresponds to a real zvol_state_t, not one in partial
destruction. Taking the global lock seems like a convenient way to
ensure this.

The problem with this is that zvol_state_lock does not actually protect
access to the zvol_state_t internals, so we need to take zv_state_lock
and/or zv_suspend_lock. If those are contended, this can then cause
OS-side operations (eg zvol_open()) to sleep to wait for them while hold
zvol_state_lock. This then blocks out all other OS-side operations which
want to get the private data, and any ZFS-side control operations that
would take the write half of the lock. It's even worse if ZFS-side
operations induce OS-side calls back into the zvol (eg creating a zvol
triggers a partition probe inside the kernel, and also a userspace
access from udev to set up device links). And it gets even works again
if anything decides to defer those ops to a task and wait on them, which
zvol_remove_minors_impl() will do under high load.

However, since the previous commit, we have a guarantee that the private
data pointer will always be NULL'd out in zvol_os_remove_minor()
_before_ the zvol_state_t is made invalid, but it won't happen until all
users are ejected. So, if we make access to the private object pointer
atomic, we remove the need to take a global lockout to access it, and so
we can remove all acquisitions of zvol_state_lock from the OS side.

While here, I've rewritten much of the locking theory comment at the top
of zvol.c. It wasn't wrong, but it hadn't been followed exactly, so I've
tried to describe the purpose of each lock in a little more detail, and
in particular describe where it should and shouldn't be used.

Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17625
This commit is contained in:
Rob Norris
2025-08-05 14:19:24 +10:00
committed by Brian Behlendorf
parent 96f9d271ea
commit 8a0e5e8b54
3 changed files with 134 additions and 113 deletions
+25 -15
View File
@@ -44,19 +44,30 @@
/*
* Note on locking of zvol state structures.
*
* These structures are used to maintain internal state used to emulate block
* devices on top of zvols. In particular, management of device minor number
* operations - create, remove, rename, and set_snapdev - involves access to
* these structures. The zvol_state_lock is primarily used to protect the
* zvol_state_list. The zv->zv_state_lock is used to protect the contents
* of the zvol_state_t structures, as well as to make sure that when the
* time comes to remove the structure from the list, it is not in use, and
* therefore, it can be taken off zvol_state_list and freed.
* zvol_state_t represents the connection between a single dataset
* (DMU_OST_ZVOL) and the device "minor" (some OS-specific representation of a
* "disk" or "device" or "volume", eg, a /dev/zdXX node, a GEOM object, etc).
*
* The zv_suspend_lock was introduced to allow for suspending I/O to a zvol,
* e.g. for the duration of receive and rollback operations. This lock can be
* held for significant periods of time. Given that it is undesirable to hold
* mutexes for long periods of time, the following lock ordering applies:
* The global zvol_state_lock is used to protect access to zvol_state_list and
* zvol_htable, which are the primary way to obtain a zvol_state_t from a name.
* It should not be used for anything not name-relateds, and you should avoid
* sleeping or waiting while its held. See zvol_find_by_name(), zvol_insert(),
* zvol_remove().
*
* The zv_state_lock is used to protect the contents of the associated
* zvol_state_t. Most of the zvol_state_t is dedicated to control and
* configuration; almost none of it is needed for data operations (that is,
* read, write, flush) so this lock is rarely taken during general IO. It
* should be released quickly; you should avoid sleeping or waiting while its
* held.
*
* zv_suspend_lock is used to suspend IO/data operations to a zvol. The read
* half should held for the duration of an IO operation. The write half should
* be taken when something to wait for IO to complete and the block further IO,
* eg for the duration of receive and rollback operations. This lock can be
* held for long periods of time.
*
* Thus, the following lock ordering appies.
* - take zvol_state_lock if necessary, to protect zvol_state_list
* - take zv_suspend_lock if necessary, by the code path in question
* - take zv_state_lock to protect zvol_state_t
@@ -67,9 +78,8 @@
* these operations are serialized per pool. Consequently, we can be certain
* that for a given zvol, there is only one operation at a time in progress.
* That is why one can be sure that first, zvol_state_t for a given zvol is
* allocated and placed on zvol_state_list, and then other minor operations
* for this zvol are going to proceed in the order of issue.
*
* allocated and placed on zvol_state_list, and then other minor operations for
* this zvol are going to proceed in the order of issue.
*/
#include <sys/dataset_kstats.h>