mirror of
https://git.proxmox.com/git/mirror_zfs.git
synced 2026-05-23 02:44:41 +03:00
zvol: stop using zvol_state_lock to protect OS-side private data
zvol_state_lock is intended to protect access to the global name->zvol lists (zvol_find_by_name()), but has also been used to control access to OS-side private data, accessed through whatever kernel object is used to represent the volume (gendisk, geom, etc). This appears to have been necessary to some degree because the OS-side object is what's used to get a handle on zvol_state_t, so zv_state_lock and zv_suspend_lock can't be used to manage access, but also, with the private object and the zvol_state_t being shutdown and destroyed at the same time in zvol_os_free(), we must ensure that the private object pointer only ever corresponds to a real zvol_state_t, not one in partial destruction. Taking the global lock seems like a convenient way to ensure this. The problem with this is that zvol_state_lock does not actually protect access to the zvol_state_t internals, so we need to take zv_state_lock and/or zv_suspend_lock. If those are contended, this can then cause OS-side operations (eg zvol_open()) to sleep to wait for them while hold zvol_state_lock. This then blocks out all other OS-side operations which want to get the private data, and any ZFS-side control operations that would take the write half of the lock. It's even worse if ZFS-side operations induce OS-side calls back into the zvol (eg creating a zvol triggers a partition probe inside the kernel, and also a userspace access from udev to set up device links). And it gets even works again if anything decides to defer those ops to a task and wait on them, which zvol_remove_minors_impl() will do under high load. However, since the previous commit, we have a guarantee that the private data pointer will always be NULL'd out in zvol_os_remove_minor() _before_ the zvol_state_t is made invalid, but it won't happen until all users are ejected. So, if we make access to the private object pointer atomic, we remove the need to take a global lockout to access it, and so we can remove all acquisitions of zvol_state_lock from the OS side. While here, I've rewritten much of the locking theory comment at the top of zvol.c. It wasn't wrong, but it hadn't been followed exactly, so I've tried to describe the purpose of each lock in a little more detail, and in particular describe where it should and shouldn't be used. Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17625
This commit is contained in:
committed by
Brian Behlendorf
parent
96f9d271ea
commit
8a0e5e8b54
+25
-15
@@ -44,19 +44,30 @@
|
||||
/*
|
||||
* Note on locking of zvol state structures.
|
||||
*
|
||||
* These structures are used to maintain internal state used to emulate block
|
||||
* devices on top of zvols. In particular, management of device minor number
|
||||
* operations - create, remove, rename, and set_snapdev - involves access to
|
||||
* these structures. The zvol_state_lock is primarily used to protect the
|
||||
* zvol_state_list. The zv->zv_state_lock is used to protect the contents
|
||||
* of the zvol_state_t structures, as well as to make sure that when the
|
||||
* time comes to remove the structure from the list, it is not in use, and
|
||||
* therefore, it can be taken off zvol_state_list and freed.
|
||||
* zvol_state_t represents the connection between a single dataset
|
||||
* (DMU_OST_ZVOL) and the device "minor" (some OS-specific representation of a
|
||||
* "disk" or "device" or "volume", eg, a /dev/zdXX node, a GEOM object, etc).
|
||||
*
|
||||
* The zv_suspend_lock was introduced to allow for suspending I/O to a zvol,
|
||||
* e.g. for the duration of receive and rollback operations. This lock can be
|
||||
* held for significant periods of time. Given that it is undesirable to hold
|
||||
* mutexes for long periods of time, the following lock ordering applies:
|
||||
* The global zvol_state_lock is used to protect access to zvol_state_list and
|
||||
* zvol_htable, which are the primary way to obtain a zvol_state_t from a name.
|
||||
* It should not be used for anything not name-relateds, and you should avoid
|
||||
* sleeping or waiting while its held. See zvol_find_by_name(), zvol_insert(),
|
||||
* zvol_remove().
|
||||
*
|
||||
* The zv_state_lock is used to protect the contents of the associated
|
||||
* zvol_state_t. Most of the zvol_state_t is dedicated to control and
|
||||
* configuration; almost none of it is needed for data operations (that is,
|
||||
* read, write, flush) so this lock is rarely taken during general IO. It
|
||||
* should be released quickly; you should avoid sleeping or waiting while its
|
||||
* held.
|
||||
*
|
||||
* zv_suspend_lock is used to suspend IO/data operations to a zvol. The read
|
||||
* half should held for the duration of an IO operation. The write half should
|
||||
* be taken when something to wait for IO to complete and the block further IO,
|
||||
* eg for the duration of receive and rollback operations. This lock can be
|
||||
* held for long periods of time.
|
||||
*
|
||||
* Thus, the following lock ordering appies.
|
||||
* - take zvol_state_lock if necessary, to protect zvol_state_list
|
||||
* - take zv_suspend_lock if necessary, by the code path in question
|
||||
* - take zv_state_lock to protect zvol_state_t
|
||||
@@ -67,9 +78,8 @@
|
||||
* these operations are serialized per pool. Consequently, we can be certain
|
||||
* that for a given zvol, there is only one operation at a time in progress.
|
||||
* That is why one can be sure that first, zvol_state_t for a given zvol is
|
||||
* allocated and placed on zvol_state_list, and then other minor operations
|
||||
* for this zvol are going to proceed in the order of issue.
|
||||
*
|
||||
* allocated and placed on zvol_state_list, and then other minor operations for
|
||||
* this zvol are going to proceed in the order of issue.
|
||||
*/
|
||||
|
||||
#include <sys/dataset_kstats.h>
|
||||
|
||||
Reference in New Issue
Block a user