Fix snapshot automount expiry cancellation deadlock

A deadlock occurs when snapshot expiry tasks are cancelled while holding
locks. The snapshot expiry task (snapentry_expire) spawns an umount
process and waits for it to complete. Concurrently, ARC memory pressure
triggers arc_prune which calls zfs_exit_fs(), attempting to cancel the
expiry task while holding locks. The umount process spawned by the
expiry task blocks trying to acquire locks held by arc_prune, which is
blocked waiting for the expiry task to complete. This creates a circular
dependency: expiry task waits for umount, umount waits for arc_prune,
arc_prune waits for expiry task.

Fix by adding non-blocking cancellation support to taskq_cancel_id().
The zfs_exit_fs() path calls zfsctl_snapshot_unmount_delay() to
reschedule the unmount, which needs to cancel any existing expiry task.
It now uses non-blocking cancellation to avoid waiting while holding
locks, breaking the deadlock by returning immediately when the task is
already running.

The per-entry se_taskqid_lock has been removed, with all taskqid
operations now protected by the global zfs_snapshot_lock held as
WRITER. Additionally, an se_in_umount flag prevents recursive waits when
zfsctl_destroy() is called during unmount. The taskqid is now only
cleared by the caller on successful cancellation; running tasks clear
their own taskqid upon completion.

Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #17941
This commit is contained in:
Ameer Hamza
2025-12-02 03:43:42 +05:00
committed by GitHub
parent 4754ac8529
commit 88d012a1d6
14 changed files with 71 additions and 48 deletions
+1 -1
View File
@@ -114,7 +114,7 @@ extern void taskq_wait_id(taskq_t *, taskqid_t);
extern void taskq_wait_outstanding(taskq_t *, taskqid_t);
extern int taskq_member(taskq_t *, kthread_t *);
extern taskq_t *taskq_of_curthread(void);
extern int taskq_cancel_id(taskq_t *, taskqid_t);
extern int taskq_cancel_id(taskq_t *, taskqid_t, boolean_t);
extern void system_taskq_init(void);
extern void system_taskq_fini(void);
+2 -2
View File
@@ -407,9 +407,9 @@ taskq_of_curthread(void)
}
int
taskq_cancel_id(taskq_t *tq, taskqid_t id)
taskq_cancel_id(taskq_t *tq, taskqid_t id, boolean_t wait)
{
(void) tq, (void) id;
(void) tq, (void) id, (void) wait;
return (ENOENT);
}
+1
View File
@@ -2137,6 +2137,7 @@
<function-decl name='taskq_cancel_id' mangled-name='taskq_cancel_id' visibility='default' binding='global' size-in-bits='64' elf-symbol-id='taskq_cancel_id'>
<parameter type-id='4f8ed29a' name='tq'/>
<parameter type-id='de0ea20e' name='id'/>
<parameter type-id='c19b74c3' name='wait'/>
<return type-id='95e97e5e'/>
</function-decl>
<function-decl name='system_taskq_init' mangled-name='system_taskq_init' visibility='default' binding='global' size-in-bits='64' elf-symbol-id='system_taskq_init'>
+1
View File
@@ -2127,6 +2127,7 @@
<function-decl name='taskq_cancel_id' mangled-name='taskq_cancel_id' visibility='default' binding='global' size-in-bits='64' elf-symbol-id='taskq_cancel_id'>
<parameter type-id='4f8ed29a' name='tq'/>
<parameter type-id='de0ea20e' name='id'/>
<parameter type-id='c19b74c3' name='wait'/>
<return type-id='95e97e5e'/>
</function-decl>
<function-decl name='system_taskq_init' mangled-name='system_taskq_init' visibility='default' binding='global' size-in-bits='64' elf-symbol-id='system_taskq_init'>