Read-Only source code mirror, Proxmox uses mailing list workflow for development.
Go to file
Thomas Lamprecht 1559d22f35 kvm: xsave set: mask-out PKRU bit in xfeatures if vCPU has no support
Fixes live-migrations & snapshot-rollback of VMs with a restricted
CPU type (e.g., qemu64) from our 5.15 based kernel (default Proxmox
VE 7.4) to the 6.2 (and future newer) of Proxmox VE 8.0.

Previous to (upstream kernel) commit ad856280ddea ("x86/kvm/fpu: Limit
guest user_xfeatures to supported bits of XCR0") the PKRU bit of the
host could leak into the state from the guest, which caused trouble
when migrating between hosts with different CPUs, i.e., where the
source supported it but the target did not, causing a general
protection fault when the guest tried to use a pkru related
instruction after the migration.

But the fix, while welcome, caused a temporary out-of-sync state when
migrating such a VM from a kernel without the fix to a kernel with
the fix, as it threw of KVM when the CPUID of the guest and most of
the state doesn't report XSAVE and thus any xfeatures, but PKRU and
the related state is set as enabled, causing the vCPU to spin at 100%
without any progress forever.

The fix could be at two sites, either in QEMU or in the kernel, I
choose the kernel as we have all the info there for a targeted
heuristic so that we don't have to adapt QEMU and qemu-server, the
latter even on both sides.

Still, a short summary of the possible fixes and short drawbacks:
* on QEMU-side either
  - clear the PKRU state in the migration saved state would be rather
    complicated to implement as the vCPU is initialised way before we
    have the saved xfeature state available to check what we'd need
    to do, plus the user-space only gets a memory blob from ioctl
    KVM_GET_XSAVE2 that it passes to KVM_SET_XSAVE ioctl, there are
    no ABI guarantees, and while the struct seem stable for 5.15 to
    6.5-rc1, that doesn't has to be for future kernels, so off the
    table.
  - enforce that the CPUID reports PKU support even if it normally
    wouldn't. While this works (tested by hard-coding it as POC) it
    is a) not really nice and b) needs some interaction from
    qemu-server to enable this flag as otherwise we have no good info
    to decide when it's OK to do this, which means we need to adapt
    both PVE 7 and 8's qemu-server and also pve-qemu, workable but
    not optimal

* on Kernel/KVM-side we can hook into the set XSAVE ioctl specific to
  the KVM subsystem, which already reduces chance of regression for
  all other places. There we have access to the union/struct
  definitions of the saved state and thus can savely cast to that.
  We also got access to the vCPU's CPUID capabilities, meaning we can
  check if the XCR0 (first XSAVE Control Register) reports
  that it support the PKRU feature, and if it does *NOT* but the
  saved xfeatures register from XSAVE *DOES* report it, we can safely
  assume that this combination is due to an migration from an older,
  leaky kernel – and clear the bit in the xfeature register before
  restoring it to the guest vCPU KVM state, avoiding the confusing
  situation that made the vCPU spin at 100%.
  This should be safe to do, as the guest vCPU CPUID never reported
  support for the PKRU feature, and it's also a relatively niche and
  newish feature.

If it gains us something we can drop this patch a bit in the future
Proxmox VE 9 major release, but we should ensure that VMs that where
started before PVE 8 cannot be directly live-migrated to the release
that includes that change; so we should rather only drop it if the
maintenance burden is high.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-07-14 19:47:11 +02:00
debian bump version to 6.2.16-4 2023-07-07 09:23:08 +02:00
patches/kernel kvm: xsave set: mask-out PKRU bit in xfeatures if vCPU has no support 2023-07-14 19:47:11 +02:00
submodules update submodule to Proxmox-6.2.16-3 2023-07-07 09:22:11 +02:00
.gitignore add *.prepared to .gitignore 2019-02-06 11:43:29 +01:00
.gitmodules change submodule url to ubuntu-kernel 2023-01-25 16:37:05 +01:00
abi-blacklist buildsys: simplify abi-check 2017-03-24 14:14:10 +01:00
abi-prev-6.2.16-4-pve update ABI file for 6.2.16-4-pve 2023-07-07 10:01:08 +02:00
fwlist-previous update fwlist for 6.2.6 2023-03-15 09:23:37 +01:00
Makefile bump version to 6.2.16-4 2023-07-07 09:23:08 +02:00
README readme: update for current source state 2022-12-13 18:07:40 +01:00

KERNEL SOURCE:
==============

We currently use the Ubuntu kernel sources, available from our mirror:

 https://git.proxmox.com/?p=mirror_ubuntu-kernels.git;a=summary

Ubuntu will maintain those kernels till:

 https://wiki.ubuntu.com/Kernel/Dev/ExtendedStable
 or
 https://pve.proxmox.com/pve-docs/chapter-pve-faq.html#faq-support-table

 whatever happens to be earlier.


Additional/Updated Modules:
---------------------------

- include native OpenZFS filesystem kernel modules for Linux

  * https://github.com/zfsonlinux/

  For licensing questions, see: http://open-zfs.org/wiki/Talk:FAQ


SUBMODULE
=========

We track the current upstream repository as submodule. Besides obvious
advantages over tracking binary tar archives this also has some implications.

For building the submodule directory gets copied into build/ and a few patches
get applied with the `patch` tool. From a git point-of-view, the copied
directory remains clean even with extra patches applied since it does not
contain a .git directory, but a reference to the (still pristine) submodule:

$ cat build/ubuntu-kernel/.git

If you mistakenly cloned the upstream repo as "normal" clone (not via the
submodule mechanics) this means that you have a real .git directory with its
independent objects and tracking info when copying for building, thus git
operates on the copied directory - and "sees" that it was dirtied by `patch`,
and thus the kernel buildsystem sees this too and will add a '+' to the version
as a result. This changes the output directories for modules and other build
artefacts and let's then the build fail on packaging.

So always ensure that you really checked it out as submodule, not as full
"normal" clone. You can also explicitly set the LOCALVERSION variable to
undefined with: `export LOCALVERSION= but that should only be done for test
builds.

RELATED PACKAGES:
=================

proxmox-ve
----------

top level meta package, depends on current default kernel series meta package.

git clone git://git.proxmox.com/git/proxmox-ve.git

pve-kernel-meta
---------------

Depends on latest kernel and header package within a certain kernel series,
e.g., pve-kernel-5.15 / pve-headers-5.15

git clone git://git.proxmox.com/git/pve-kernel-meta.git

pve-firmware
------------

Contains the firmware for all released PVE kernels.

git clone git://git.proxmox.com/git/pve-firmware.git


NOTES:
======

ABI versions, package versions and package name:
------------------------------------------------

We follow debian's versioning w.r.t ABI changes:

https://kernel-team.pages.debian.net/kernel-handbook/ch-versions.html
https://wiki.debian.org/DebianKernelABIChanges

The debian/rules file has a target comparing the build kernel's ABI against the
version stored in the repository and indicates when an ABI bump is necessary.
An ABI bump within one upstream version consists of incrementing the KREL
variable in the Makefile, rebuilding the packages and running 'make abiupdate'
(the 'abiupdate' target in 'Makefile' contains the steps for consistently
updating the repository).

Watchdog blacklist
------------------

By default, all watchdog modules are black-listed because it is totally undefined
which device is actually used for /dev/watchdog.
We ship this list in /lib/modprobe.d/blacklist_pve-kernel-<VERSION>.conf
The user typically edit /etc/modules to enable a specific watchdog device.

Debug kernel and modules
------------------------

In order to build a -dbgsym package containing an unstripped copy of the kernel
image and modules, enable the 'pkg.pve-kernel.debug' build profile (e.g. by
exporting DEB_BUILD_PROFILES='pkg.pve-kernel.debug'). The resulting package can
be used together with 'crash'/'kdump-tools' to debug kernel crashes.

Note: the -dbgsym package is only valid for the pve-kernel packages produced by
the same build. A kernel/module from a different build will likely not match,
even if both builds are of the same kernel and package version.

Additional information
----------------------

We use the default configuration provided by Ubuntu, and apply
the following modifications:

NOTE: For the exact and current list see debian/rules (PVE_CONFIG_OPTS)

- enable INTEL_MEI_WDT=m (to allow disabling via patch)

- disable CONFIG_SND_PCM_OSS (enabled by default in Ubuntu, not needed)

- switch CONFIG_TRANSPARENT_HUGEPAGE to MADVISE from ALWAYS

- enable CONFIG_CEPH_FS=m (request from user)

- enable common CONFIG_BLK_DEV_XXX to avoid hardware detection
  problems (udev, update-initramfs have serious problems without that)

  	 CONFIG_BLK_DEV_SD=y
  	 CONFIG_BLK_DEV_SR=y
  	 CONFIG_BLK_DEV_DM=y

- compile NBD and RBD modules
	 CONFIG_BLK_DEV_NBD=m
	 CONFIG_BLK_DEV_RBD=m

- enable IBM JFS file system as module
  requested by users (bug #64)

- enable apple HFS and HFSPLUS as module
  requested by users

- enable CONFIG_BCACHE=m (requested by user)

- enable CONFIG_BRIDGE=y
  to avoid warnings on boot, e.g. that net.bridge.bridge-nf-call-iptables is an unknown key

- enable CONFIG_DEFAULT_SECURITY_APPARMOR
  We need this for lxc

- set CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y
  because if not set, it can give some dynamic memory or cpu frequencies 
  change, and vms can crash (mainly windows guest).
  see http://forum.proxmox.com/threads/18238-Windows-7-x64-VMs-crashing-randomly-during-process-termination?p=93273#post93273

- use 'deadline' as default scheduler
  This is the suggested setting for KVM. We also measure bad fsync performance with ext4 and cfq.

- disable CONFIG_INPUT_EVBUG
  Module evbug is not blacklisted on debian, so we simply disable it to avoid
  key-event logs (which is a big security problem)

- enable CONFIG_MODVERSIONS (needed for ABI tracking)

- switch default UNWINDER to FRAME_POINTER
  the recently introduced ORC_UNWINDER is not 100% stable yet, especially in combination with ZFS

- enable CONFIG_PAGE_TABLE_ISOLATION (Meltdown mitigation)