29cb6fcbb7
Users have been reporting [1] that VMs occasionally become unresponsive with high CPU usage for some time (varying between ~1 and more than 60 seconds). After that time, the guests come back and continue running. Windows VMs seem most affected (not responding to pings during the hang, RDP sessions time out), but we also got reports about Linux VMs (reporting soft lockups). The issue was not present on host kernel 5.15 and was first reported with kernel 6.2. Users reported that the issue becomes easier to trigger the more memory is assigned to the guests. Setting mitigations=off was reported to alleviate (but not eliminate) the issue. For most users the issue seems to disappear after (also) disabling KSM [2], but some users reported freezes even with KSM disabled [3]. It turned out the reports concerned NUMA hosts only, and that the freezes correlated with runs of the NUMA balancer [4]. Users reported that disabling the NUMA balancer resolves the issue (even with KSM enabled). We put together a Linux VM reproducer, ran a git-bisect on the kernel to find the commit introducing the issue and asked upstream for help [5]. As it turned out, an upstream bugreport was recently opened [6] and a preliminary fix to the KVM TDP MMU was proposed [7]. With that patch [7] on top of kernel 6.7, the reproducer does not trigger freezes anymore. As of now, the patch (or its v2 [8]) is not yet merged in the mainline kernel, and backporting it may be difficult due to dependencies on other KVM changes [9]. However, the bugreport [6] also prompted an upstream developer to propose a patch to the kernel scheduler logic that decides whether a contended spinlock/rwlock should be dropped [10]. Without the patch, PREEMPT_DYNAMIC kernels (such as ours) would always drop contended locks. With the patch, the kernel only drops contended locks if the kernel is currently set to preempt=full. As noted in the commit message [10], this can (counter-intuitively) improve KVM performance. Our kernel defaults to preempt=voluntary (according to /sys/kernel/debug/sched/preempt), so with the patch it does not drop contended locks anymore, and the reproducer does not trigger freezes anymore. Hence, backport [10] to our kernel. [1] https://forum.proxmox.com/threads/130727/ [2] https://forum.proxmox.com/threads/130727/page-4#post-575886 [3] https://forum.proxmox.com/threads/130727/page-8#post-617587 [4] https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#numa-balancing [5] https://lore.kernel.org/kvm/832697b9-3652-422d-a019-8c0574a188ac@proxmox.com/ [6] https://bugzilla.kernel.org/show_bug.cgi?id=218259 [7] https://lore.kernel.org/all/20230825020733.2849862-1-seanjc@google.com/ [8] https://lore.kernel.org/all/20240110012045.505046-1-seanjc@google.com/ [9] https://lore.kernel.org/kvm/Zaa654hwFKba_7pf@google.com/ [10] https://lore.kernel.org/all/20240110214723.695930-1-seanjc@google.com/ Signed-off-by: Friedrich Weber <f.weber@proxmox.com> |
||
---|---|---|
.. | ||
0001-Make-mkcompile_h-accept-an-alternate-timestamp-strin.patch | ||
0002-wireless-Add-Debian-wireless-regdb-certificates.patch | ||
0003-bridge-keep-MAC-of-first-assigned-port.patch | ||
0004-pci-Enable-overrides-for-missing-ACS-capabilities-4..patch | ||
0005-kvm-disable-default-dynamic-halt-polling-growth.patch | ||
0006-net-core-downgrade-unregister_netdevice-refcount-lea.patch | ||
0007-Revert-fortify-Do-not-cast-to-unsigned-char.patch | ||
0008-kvm-xsave-set-mask-out-PKRU-bit-in-xfeatures-if-vCPU.patch | ||
0009-allow-opt-in-to-allow-pass-through-on-broken-hardwar.patch | ||
0010-Revert-nSVM-Check-for-reserved-encodings-of-TLB_CONT.patch | ||
0011-KVM-nSVM-Advertise-support-for-flush-by-ASID.patch | ||
0012-revert-memfd-improve-userspace-warnings-for-missing-.patch | ||
0013-drm-amd-Fix-UBSAN-array-index-out-of-bounds-for-Powe.patch | ||
0014-Revert-scsi-aacraid-Reply-queue-mapping-to-CPUs-base.patch | ||
0015-ext4-fallback-to-complex-scan-if-aligned-scan-doesn-.patch | ||
0018-sched-core-Drop-spinlocks-on-contention-iff-kernel-i.patch |