It's in master-next of current ubuntu noble kernel git tree and a null
check cannot really hurt.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
The reporter has an Adaptec 5805 controller (using the aacraid
driver), which reports a byteswapped page length for VPD page 0. It
reports "02 00" as page length instead of "00 02".
This stopped working with kernel 6.8.4 due to commit b5fc07a5fb56
("scsi: core: Consult supported VPD page list prior to fetching page")
To address that issue limit the page search scope to the size of our
VPD buffer to guard against devices returning a larger page count than
requested.
Reported-by: Peter Schneider <pschneider1968@googlemail.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
The patch from commit e5731f4 ("backport fix for managing block flush
queue list") caused some fallout when used with LVM on root, as that
uses some rather odd (but previously working fine) PREFLUSH
| POSTFLUSH format that was now causing the list to be used without
being initialized, resulting in freezes.
Link: https://lore.kernel.org/all/20240608143115.972486-1-chengming.zhou@linux.dev/
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Reported in the community forum [0] and easy to reproduce by doing
e.g.
> while true; do mount -t nfs 192.168.20.148:/rpool/data /mnt/test; done
from another node for a share that does not exist or for which the
client has no permissions.
[0]: https://forum.proxmox.com/threads/146649
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
The original fix disabled the xsaves feature for zen1/2. The issue has
since been fixed in the cpus microcode and this patch keeps the feature enabled
if the microcode version is recent enough to contain the fix.
Signed-off-by: Folke Gleumes <f.gleumes@proxmox.com>
With apparmor 4, when recvmsg() calls are checked by the apparmor LSM
they will always return EINVAL.
This causes very weird issues when apparmor profiles are in use, and a
lot of networking issues in containers (which are always using
apparmor).
When coming from sys_recvmsg, msg->msg_namelen is explicitly set to
zero early on. (see ____sys_recvmsg in net/socket.c)
We still end up in 'map_addr' where the assumption is that addr !=
NULL means addrlen has a valid size.
This is likely not a final fix, it was suggested by jjohansen on irc
to get things going until this is resolved properly.
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
This reverts commit 29cb6fcbb7, user
feedback was showing any positive impact of this patch, and upstream
still hasn't a fix for older stable releases (but for 6.8), so for now
rather revert this and wait for either a better (well, actual) fix or
updating to 6.8 or newer.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Users have been reporting [1] that VMs occasionally become
unresponsive with high CPU usage for some time (varying between ~1 and
more than 60 seconds). After that time, the guests come back and
continue running. Windows VMs seem most affected (not responding to
pings during the hang, RDP sessions time out), but we also got reports
about Linux VMs (reporting soft lockups). The issue was not present on
host kernel 5.15 and was first reported with kernel 6.2. Users
reported that the issue becomes easier to trigger the more memory is
assigned to the guests. Setting mitigations=off was reported to
alleviate (but not eliminate) the issue. For most users the issue
seems to disappear after (also) disabling KSM [2], but some users
reported freezes even with KSM disabled [3].
It turned out the reports concerned NUMA hosts only, and that the
freezes correlated with runs of the NUMA balancer [4]. Users reported
that disabling the NUMA balancer resolves the issue (even with KSM
enabled).
We put together a Linux VM reproducer, ran a git-bisect on the kernel
to find the commit introducing the issue and asked upstream for help
[5]. As it turned out, an upstream bugreport was recently opened [6]
and a preliminary fix to the KVM TDP MMU was proposed [7]. With that
patch [7] on top of kernel 6.7, the reproducer does not trigger
freezes anymore. As of now, the patch (or its v2 [8]) is not yet
merged in the mainline kernel, and backporting it may be difficult due
to dependencies on other KVM changes [9].
However, the bugreport [6] also prompted an upstream developer to
propose a patch to the kernel scheduler logic that decides whether a
contended spinlock/rwlock should be dropped [10]. Without the patch,
PREEMPT_DYNAMIC kernels (such as ours) would always drop contended
locks. With the patch, the kernel only drops contended locks if the
kernel is currently set to preempt=full. As noted in the commit
message [10], this can (counter-intuitively) improve KVM performance.
Our kernel defaults to preempt=voluntary (according to
/sys/kernel/debug/sched/preempt), so with the patch it does not drop
contended locks anymore, and the reproducer does not trigger freezes
anymore. Hence, backport [10] to our kernel.
[1] https://forum.proxmox.com/threads/130727/
[2] https://forum.proxmox.com/threads/130727/page-4#post-575886
[3] https://forum.proxmox.com/threads/130727/page-8#post-617587
[4] https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#numa-balancing
[5] https://lore.kernel.org/kvm/832697b9-3652-422d-a019-8c0574a188ac@proxmox.com/
[6] https://bugzilla.kernel.org/show_bug.cgi?id=218259
[7] https://lore.kernel.org/all/20230825020733.2849862-1-seanjc@google.com/
[8] https://lore.kernel.org/all/20240110012045.505046-1-seanjc@google.com/
[9] https://lore.kernel.org/kvm/Zaa654hwFKba_7pf@google.com/
[10] https://lore.kernel.org/all/20240110214723.695930-1-seanjc@google.com/
Signed-off-by: Friedrich Weber <f.weber@proxmox.com>
to silence array-index-out-of-bounds warnings for dynamically-sized
arrays. All commits applied cleanly and just replace array[1] with
array[].
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
This is generating far too much noise in the logs, so keep it at once
per boot until we (and other user space tools) adapted to the kernel
wanting user space to chose memfd execution behavior very explicitly.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
This improves compatibility for guests w.r.t. live-migration, or live
snapshot rollback, to hosts with less (FPU) xfeatures supported, as
long as the set of features that was actually exposed to the guest is
still supported.
This improves on the ad856280ddea ("x86/kvm/fpu: Limit guest
user_xfeatures to supported bits of XCR0") bug fix.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
this exposes the FLUSHBYASID CPU flag to nested VMs when running on an
AMD CPU. also reverts a made up check that would advertise
FLUSHBYASID as not supported. this enable certain modern hypervisors
such as VMWare ESXi 7 and Workstation 17 to run nested VMs properly
again.
Signed-off-by: Stefan Sterz <s.sterz@proxmox.com>
The latest amd64-microcode package in sid [0] (which probably will
eventually make it to bookworm-security) has a change that requires
the added patch to work properly.
The changelog-entry refers to stable k.o branches only - but a quick
look through the linux-firmware.git log identifies:
`f2eb058afc57348cde66852272d6bf11da1eef8f` as relevant commit, which
refers (as NOTE in the patch) to:
a32b0f0db3f3 ("x86/microcode/AMD: Load late on both threads too")
which applies cleanly (although I cherry-picked the patch from the
6.1.y stable branch to have the original commit in the commit
message).
quickly tested compiling and booting the result in a VM (however w/o
a fitting CPU (Epyc Genoa or Bergamo) it should cause a change)
reported in our Enterprise Support as potential culprit for one
thread from 128 being reported as offline in `lscpu`
[0] https://metadata.ftp-master.debian.org/changelogs//non-free-firmware/a/amd64-microcode/amd64-microcode_3.20230808.1.1_changelog
Signed-off-by: Stoiko Ivanov <s.ivanov@proxmox.com>
Originally for v6.4-rc7 and now it also got already into some stable
trees, but not yet into a (released) ubuntu tag – so backport it
already.
Link: https://forum.proxmox.com/threads/133104/post-590457
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Avoids regressions where some code falsely think they cannot use some
CPU features like AVX1, e.g., ZFS.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
A user of ours reported an issue with p2p thunderbolt-net w.r.t. IPv6
and failure to reestablish the connection after a reboot of a peer
node, in the forum [0] and the relayed it upstream, so lets
cherry-pick those two patches to our 6.2. Especially the IPv6 one
seems straight forward, and the other one makes it actually spec
conform and should only improve things.
[0]: https://forum.proxmox.com/threads/133104/
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
The mailing list thread [0] (found by Friedrich, many thanks!) leading
up to this patch sounds very familiar to issues users reported in the
community forum [1] and enterprise support channel, where a VM would
be stuck for no discernable reason with all vCPU threads spinning.
[0]: https://lore.kernel.org/all/f023d927-52aa-7e08-2ee5-59a2fbc65953@gameservers.com/T/#u
[1]: https://forum.proxmox.com/threads/127459/
Suggested-by: Friedrich Weber <f.weber@proxmox.com>
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
While there is no actual issue, users are still nervous about the
faulty logging [0]. It might take a while until the fix comes in via
upstream, so just pick it up manually.
[0]: https://forum.proxmox.com/threads/130628/post-583864
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
There were several reports about issues related to igc and tx timeout
and while the issue couldn't be reproduced locally, the hope is that
this fix Friedrich found will resolve the issue for the users. The
kernel versions in the reports would match with when 9b275176270e
("igc: Add ndo_tx_timeout support"), i.e. the one fixed by this
commit, landed.
[0]: https://forum.proxmox.com/threads/130935/
[1]: https://forum.proxmox.com/threads/130415/#post-580064
[2]: https://forum.proxmox.com/threads/132138/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
by cherry-picking the relevant commits from launchpad/lunar [0].
(relevant commits are based on k.o/stable commits for this)
minimally tested by booting my (ryzen) machine with this kernel and
skimming through dmesg after boot.
[0] git://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/lunar
Signed-off-by: Stoiko Ivanov <s.ivanov@proxmox.com>
the actual fix is the microcode update, but this is a stop-gap (with
a performance penalty) setting a chicken bit on affected CPUs that do
not have the new enough microcode loaded, disabling some features.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
we got quite some reports for this (e.g., Bugzilla or [0]), well in
non-enterprise setups as those cheap NVMe's just don't bother holding
up basic principles...
[0]: https://forum.proxmox.com/threads/128738/#post-567249
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Fixes live-migrations & snapshot-rollback of VMs with a restricted
CPU type (e.g., qemu64) from our 5.15 based kernel (default Proxmox
VE 7.4) to the 6.2 (and future newer) of Proxmox VE 8.0.
Previous to (upstream kernel) commit ad856280ddea ("x86/kvm/fpu: Limit
guest user_xfeatures to supported bits of XCR0") the PKRU bit of the
host could leak into the state from the guest, which caused trouble
when migrating between hosts with different CPUs, i.e., where the
source supported it but the target did not, causing a general
protection fault when the guest tried to use a pkru related
instruction after the migration.
But the fix, while welcome, caused a temporary out-of-sync state when
migrating such a VM from a kernel without the fix to a kernel with
the fix, as it threw of KVM when the CPUID of the guest and most of
the state doesn't report XSAVE and thus any xfeatures, but PKRU and
the related state is set as enabled, causing the vCPU to spin at 100%
without any progress forever.
The fix could be at two sites, either in QEMU or in the kernel, I
choose the kernel as we have all the info there for a targeted
heuristic so that we don't have to adapt QEMU and qemu-server, the
latter even on both sides.
Still, a short summary of the possible fixes and short drawbacks:
* on QEMU-side either
- clear the PKRU state in the migration saved state would be rather
complicated to implement as the vCPU is initialised way before we
have the saved xfeature state available to check what we'd need
to do, plus the user-space only gets a memory blob from ioctl
KVM_GET_XSAVE2 that it passes to KVM_SET_XSAVE ioctl, there are
no ABI guarantees, and while the struct seem stable for 5.15 to
6.5-rc1, that doesn't has to be for future kernels, so off the
table.
- enforce that the CPUID reports PKU support even if it normally
wouldn't. While this works (tested by hard-coding it as POC) it
is a) not really nice and b) needs some interaction from
qemu-server to enable this flag as otherwise we have no good info
to decide when it's OK to do this, which means we need to adapt
both PVE 7 and 8's qemu-server and also pve-qemu, workable but
not optimal
* on Kernel/KVM-side we can hook into the set XSAVE ioctl specific to
the KVM subsystem, which already reduces chance of regression for
all other places. There we have access to the union/struct
definitions of the saved state and thus can savely cast to that.
We also got access to the vCPU's CPUID capabilities, meaning we can
check if the XCR0 (first XSAVE Control Register) reports
that it support the PKRU feature, and if it does *NOT* but the
saved xfeatures register from XSAVE *DOES* report it, we can safely
assume that this combination is due to an migration from an older,
leaky kernel – and clear the bit in the xfeature register before
restoring it to the guest vCPU KVM state, avoiding the confusing
situation that made the vCPU spin at 100%.
This should be safe to do, as the guest vCPU CPUID never reported
support for the PKRU feature, and it's also a relatively niche and
newish feature.
If it gains us something we can drop this patch a bit in the future
Proxmox VE 9 major release, but we should ensure that VMs that where
started before PVE 8 cannot be directly live-migrated to the release
that includes that change; so we should rather only drop it if the
maintenance burden is high.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>