cherry-pick scheduler fix to avoid temporary VM freezes on NUMA hosts
Users have been reporting [1] that VMs occasionally become unresponsive with high CPU usage for some time (varying between ~1 and more than 60 seconds). After that time, the guests come back and continue running. Windows VMs seem most affected (not responding to pings during the hang, RDP sessions time out), but we also got reports about Linux VMs (reporting soft lockups). The issue was not present on host kernel 5.15 and was first reported with kernel 6.2. Users reported that the issue becomes easier to trigger the more memory is assigned to the guests. Setting mitigations=off was reported to alleviate (but not eliminate) the issue. For most users the issue seems to disappear after (also) disabling KSM [2], but some users reported freezes even with KSM disabled [3]. It turned out the reports concerned NUMA hosts only, and that the freezes correlated with runs of the NUMA balancer [4]. Users reported that disabling the NUMA balancer resolves the issue (even with KSM enabled). We put together a Linux VM reproducer, ran a git-bisect on the kernel to find the commit introducing the issue and asked upstream for help [5]. As it turned out, an upstream bugreport was recently opened [6] and a preliminary fix to the KVM TDP MMU was proposed [7]. With that patch [7] on top of kernel 6.7, the reproducer does not trigger freezes anymore. As of now, the patch (or its v2 [8]) is not yet merged in the mainline kernel, and backporting it may be difficult due to dependencies on other KVM changes [9]. However, the bugreport [6] also prompted an upstream developer to propose a patch to the kernel scheduler logic that decides whether a contended spinlock/rwlock should be dropped [10]. Without the patch, PREEMPT_DYNAMIC kernels (such as ours) would always drop contended locks. With the patch, the kernel only drops contended locks if the kernel is currently set to preempt=full. As noted in the commit message [10], this can (counter-intuitively) improve KVM performance. Our kernel defaults to preempt=voluntary (according to /sys/kernel/debug/sched/preempt), so with the patch it does not drop contended locks anymore, and the reproducer does not trigger freezes anymore. Hence, backport [10] to our kernel. [1] https://forum.proxmox.com/threads/130727/ [2] https://forum.proxmox.com/threads/130727/page-4#post-575886 [3] https://forum.proxmox.com/threads/130727/page-8#post-617587 [4] https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#numa-balancing [5] https://lore.kernel.org/kvm/832697b9-3652-422d-a019-8c0574a188ac@proxmox.com/ [6] https://bugzilla.kernel.org/show_bug.cgi?id=218259 [7] https://lore.kernel.org/all/20230825020733.2849862-1-seanjc@google.com/ [8] https://lore.kernel.org/all/20240110012045.505046-1-seanjc@google.com/ [9] https://lore.kernel.org/kvm/Zaa654hwFKba_7pf@google.com/ [10] https://lore.kernel.org/all/20240110214723.695930-1-seanjc@google.com/ Signed-off-by: Friedrich Weber <f.weber@proxmox.com>
This commit is contained in:
parent
5dde66b4fe
commit
29cb6fcbb7
@ -0,0 +1,78 @@
|
||||
From 39f2bfe0177d3f56c9feac4e70424e4952949e2a Mon Sep 17 00:00:00 2001
|
||||
From: Sean Christopherson <seanjc@google.com>
|
||||
Date: Wed, 10 Jan 2024 13:47:23 -0800
|
||||
Subject: [PATCH] sched/core: Drop spinlocks on contention iff kernel is
|
||||
preemptible
|
||||
|
||||
Use preempt_model_preemptible() to detect a preemptible kernel when
|
||||
deciding whether or not to reschedule in order to drop a contended
|
||||
spinlock or rwlock. Because PREEMPT_DYNAMIC selects PREEMPTION, kernels
|
||||
built with PREEMPT_DYNAMIC=y will yield contended locks even if the live
|
||||
preemption model is "none" or "voluntary". In short, make kernels with
|
||||
dynamically selected models behave the same as kernels with statically
|
||||
selected models.
|
||||
|
||||
Somewhat counter-intuitively, NOT yielding a lock can provide better
|
||||
latency for the relevant tasks/processes. E.g. KVM x86's mmu_lock, a
|
||||
rwlock, is often contended between an invalidation event (takes mmu_lock
|
||||
for write) and a vCPU servicing a guest page fault (takes mmu_lock for
|
||||
read). For _some_ setups, letting the invalidation task complete even
|
||||
if there is mmu_lock contention provides lower latency for *all* tasks,
|
||||
i.e. the invalidation completes sooner *and* the vCPU services the guest
|
||||
page fault sooner.
|
||||
|
||||
But even KVM's mmu_lock behavior isn't uniform, e.g. the "best" behavior
|
||||
can vary depending on the host VMM, the guest workload, the number of
|
||||
vCPUs, the number of pCPUs in the host, why there is lock contention, etc.
|
||||
|
||||
In other words, simply deleting the CONFIG_PREEMPTION guard (or doing the
|
||||
opposite and removing contention yielding entirely) needs to come with a
|
||||
big pile of data proving that changing the status quo is a net positive.
|
||||
|
||||
Cc: Valentin Schneider <valentin.schneider@arm.com>
|
||||
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
|
||||
Cc: Marco Elver <elver@google.com>
|
||||
Cc: Frederic Weisbecker <frederic@kernel.org>
|
||||
Cc: David Matlack <dmatlack@google.com>
|
||||
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
||||
---
|
||||
include/linux/sched.h | 14 ++++++--------
|
||||
1 file changed, 6 insertions(+), 8 deletions(-)
|
||||
|
||||
diff --git a/include/linux/sched.h b/include/linux/sched.h
|
||||
index 292c31697248..a274bc85f222 100644
|
||||
--- a/include/linux/sched.h
|
||||
+++ b/include/linux/sched.h
|
||||
@@ -2234,11 +2234,10 @@ static inline bool preempt_model_preemptible(void)
|
||||
*/
|
||||
static inline int spin_needbreak(spinlock_t *lock)
|
||||
{
|
||||
-#ifdef CONFIG_PREEMPTION
|
||||
+ if (!preempt_model_preemptible())
|
||||
+ return 0;
|
||||
+
|
||||
return spin_is_contended(lock);
|
||||
-#else
|
||||
- return 0;
|
||||
-#endif
|
||||
}
|
||||
|
||||
/*
|
||||
@@ -2251,11 +2250,10 @@ static inline int spin_needbreak(spinlock_t *lock)
|
||||
*/
|
||||
static inline int rwlock_needbreak(rwlock_t *lock)
|
||||
{
|
||||
-#ifdef CONFIG_PREEMPTION
|
||||
+ if (!preempt_model_preemptible())
|
||||
+ return 0;
|
||||
+
|
||||
return rwlock_is_contended(lock);
|
||||
-#else
|
||||
- return 0;
|
||||
-#endif
|
||||
}
|
||||
|
||||
static __always_inline bool need_resched(void)
|
||||
--
|
||||
2.39.2
|
||||
|
Loading…
Reference in New Issue
Block a user