As reported in the community forum [0], doing a snapshot without
saving the VM state for a VM with a VirtIO block device with iothread
would lead to an assertion failure [1] and thus crash.
The issue is that vm_start() is called from the coroutine
qmp_savevm_end() which violates assumptions about graph locking down
the line. Factor out the part of qmp_savevm_end() that actually needs
to be a coroutine into a separate helper and turn qmp_savevm_end()
into a non-coroutine, so that it can call vm_start() safely.
The issue is likely not new, but was exposed by the recent graph
locking rework introducing stricter checks.
The issue does not occur when saving the VM state, because then the
non-coroutine process_savevm_finalize() will already call vm_start()
before qmp_savevm_end().
[0]: https://forum.proxmox.com/threads/149883/
[1]:
> #0 0x00007353e6096e2c __pthread_kill_implementation (libc.so.6 + 0x8ae2c)
> #1 0x00007353e6047fb2 __GI_raise (libc.so.6 + 0x3bfb2)
> #2 0x00007353e6032472 __GI_abort (libc.so.6 + 0x26472)
> #3 0x00007353e6032395 __assert_fail_base (libc.so.6 + 0x26395)
> #4 0x00007353e6040eb2 __GI___assert_fail (libc.so.6 + 0x34eb2)
> #5 0x0000592002307bb3 bdrv_graph_rdlock_main_loop (qemu-system-x86_64 + 0x83abb3)
> #6 0x00005920022da455 bdrv_change_aio_context (qemu-system-x86_64 + 0x80d455)
> #7 0x00005920022da6cb bdrv_try_change_aio_context (qemu-system-x86_64 + 0x80d6cb)
> #8 0x00005920022fe122 blk_set_aio_context (qemu-system-x86_64 + 0x831122)
> #9 0x00005920021b7b90 virtio_blk_start_ioeventfd (qemu-system-x86_64 + 0x6eab90)
> #10 0x0000592002022927 virtio_bus_start_ioeventfd (qemu-system-x86_64 + 0x555927)
> #11 0x0000592002066cc4 vm_state_notify (qemu-system-x86_64 + 0x599cc4)
> #12 0x000059200205d517 vm_prepare_start (qemu-system-x86_64 + 0x590517)
> #13 0x000059200205d56b vm_start (qemu-system-x86_64 + 0x59056b)
> #14 0x00005920020a43fd qmp_savevm_end (qemu-system-x86_64 + 0x5d73fd)
> #15 0x00005920023f3749 qmp_marshal_savevm_end (qemu-system-x86_64 + 0x926749)
> #16 0x000059200242f1d8 qmp_dispatch (qemu-system-x86_64 + 0x9621d8)
> #17 0x000059200238fa98 monitor_qmp_dispatch (qemu-system-x86_64 + 0x8c2a98)
> #18 0x000059200239044e monitor_qmp_dispatcher_co (qemu-system-x86_64 + 0x8c344e)
> #19 0x000059200245359b coroutine_trampoline (qemu-system-x86_64 + 0x98659b)
> #20 0x00007353e605d9c0 n/a (libc.so.6 + 0x519c0)
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
On ARM, gcc warns (-Werror=type-limits) that it will always be false
for the if statement. This is because here s->aid is defined as char,
while proxmox_restore_open_image() returns an int.
This is probably because chars are treated as unsigned on arm arch but
signed on x86 arch:
https://developer.arm.com/documentation/den0013/d/Porting/Miscellaneous-C-porting-issues/unsigned-char-and-signed-char
Make aid an explicit uint8_t, because that is the type for functions
taking the aid as a parameter, e.g. proxmox_restore_get_image_length().
Signed-off-by: Jing Luo <jing@jing.rocks>
[FE: slightly improve commit message]
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Most importantly the first one "Revert "monitor: use
aio_co_reschedule_self()"", fixing a crash when doing hotplug+resize
with a disk using io_uring.
Other fixes (likely not too important) for TCG emulation of x86(_64)
and ARM.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Most importantly, fix forwards and backwards migration with VirtIO-GPU
display.
Other fixes are for a regression in pflash device (introduced in 8.2)
and some fixes for x86(_64) TCG emulation. One of the patches needed
to be adapted, because it removed a helper that is still in use in
9.0.0.
There also is a revert for a fix in VirtIO PCI devices that turned out
to cause some issues, see the revert itself for more details.
Lastly, there is a change to move compatibility flags for a new
VirtIO-net feature to the correct machine type. The feature was
introduced in QEMU 8.2, but the compatibility flags got added to
machine version 8.0 instead of 8.1. This breaks backwards migration
with machine version 8.1 from a 8.2/9.0 binary to an 8.1 binary, in
cases where the guest kernel enables the feature (e.g. Ubuntu 23.10).
While that breaks migration with machine version 8.1 from an unpatched
to a patched binary, Proxmox VE only ever had 8.2 on the test
repository and 9.0 not yet in any public repository. An upstream
developer suggested it is the proper fix [0]. Upstream submission [1].
[0]: https://lore.kernel.org/qemu-devel/CACGkMEtZrJuhof+hUGVRvLLQE+8nQE5XmSHpT0NAQ1EpnqfmsA@mail.gmail.com/T/#u
[1]: https://lore.kernel.org/qemu-devel/20240517075336.104091-1-f.ebner@proxmox.com/T/#u
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
With fleecing, failure for copy-before-write does not fail the guest
write, but only sets the snapshot error that is associated to the
copy-before-write filter, making further requests to the snapshot
access fail with EACCES, which then also fails the job. But that error
code is not the root cause of why the backup failed, so bubble up the
original snapshot error instead.
Reported-by: Friedrich Weber <f.weber@proxmox.com>
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
The type for the copy-before-write timeout in nanoseconds was wrong.
By being just uint32_t, a maximum of slightly over 4 seconds was
possible. Larger values would overflow and thus the 45 seconds set by
Proxmox's backup with fleecing, resulted in effectively 2 seconds
timeout for copy-before-write operations.
Reported-by: Friedrich Weber <f.weber@proxmox.com>
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Biggest change is that AioContext locking got removed, but no changes
required other than dropping the calls to acquire and release it. As a
consequence, the single parameter for the bdrv_graph_wrlock() call got
removed which also required adaptation.
QAPI docs became stricter requiring to document all members.
Other minor changes:
- Single parameter from migration_is_running() was dropped.
- qemu_mutex_(un)lock_iothread() got renamed to bql_(un)lock().
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
This version includes both the AioContext lock and the block graph
lock, so there might be some deadlocks lurking. It's not possible to
disable the block graph lock like was done in QEMU 8.1, because there
are no changes like the function bdrv_schedule_unref() that require
it. QEMU 9.0 will finally get rid of the AioContext locking.
During live-restore with a VirtIO SCSI drive with iothread there is a
known racy deadlock related to the AioContext lock. Not new [1], but
not sure if more likely now. Should be fixed in QEMU 9.0.
The block graph lock comes with annotations that can be checked by
clang's TSA. This required changes to the block drivers, i.e.
alloc-track, pbs, zeroinit as well as taking the appropriate locks
in pve-backup, savevm-async, vma-reader.
Local variable shadowing is prohibited via a compiler flag now,
required slight adaptation in vma.c.
Major changes only affect alloc-track:
* It is not possible to call a generated co-wrapper like
bdrv_get_info() while holding the block graph lock exclusively [0],
which does happen during initialization of alloc-track when the
backing hd is set and the refresh_limits driver callback is invoked.
The bdrv_get_info() call to get the cluster size is moved to
directly after opening the file child in track_open().
The important thing is that at least the request alignment for the
write target is used, because then the RMW cycle in bdrv_pwritev
will gather enough data from the backing file. Partial cluster
allocations in the target are not a fundamental issue, because the
driver returns its allocation status based on the bitmap, so any
other data that maps to the same cluster will still be copied later
by a stream job (or during writes to that cluster).
* Replacing the node cannot be done in the
track_co_change_backing_file() callback, because it is a coroutine
and cannot hold the block graph lock exclusively. So it is moved to
the stream job itself with the auto-remove option not having an
effect anymore (qemu-server would always set it anyways).
In the future, there could either be a special option for the stream
job, or maybe the upcoming blockdev-replace QMP command can be used.
Replacing the backing child is actually already done in the stream
job, so no need to do it in the track_co_change_backing_file()
callback. It also cannot be called from a coroutine. Looking at the
implementation in the qcow2 driver, it doesn't seem to be intended
to change the backing child itself, just update driver-internal
state.
Other changes:
* alloc-track: Error out early when used without auto-remove. Since
replacing the node now happens in the stream job, where the option
cannot be read from (it's internal to the driver), it will always be
treated as 'on'. Makes sure to have users beside qemu-server notice
the change (should they even exist). The option can be fully dropped
in the future while adding a version guard in qemu-server.
* alloc-track: Avoid seemingly superfluous child permission update.
Doesn't seem necessary nowadays (maybe after commit "alloc-track:
fix deadlock during drop" where the dropping is not rescheduled and
delayed anymore or some upstream change). Replacing the block node
will already update the permissions of the new node (which was the
file child before). Should there really be some issue, instead of
having a drop state, this could also be just based off the fact
whether there is still a backing child.
Dumping the cumulative (shared) permissions for the BDS with a debug
print yields the same values after this patch and with QEMU 8.1,
namely 3 and 5.
* PBS block driver: compile unconditionally. Proxmox VE always needs
it and something in the build process changed to make it not enabled
by default. Probably would need to move the build option to meson
otherwise.
* backup: job unreferencing during cleanup needs to happen outside of
coroutine, so it was moved to before invoking the clean
* mirror: Cherry-pick stable fix to avoid potential deadlock.
* savevm-async: migrate_init now can fail, so propagate potential
error.
* savevm-async: compression counters are not accessible outside
migration/ram-compress now, so drop code that prophylactically set
it to zero.
[0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/
[1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Excerpt from Fiona's v3 cover-letter [0]:
When a backup for a VM is started, QEMU will install a
"copy-before-write" filter in its block layer. This filter ensures
that upon new guest writes, old data still needed for the backup is
sent to the backup target first. The guest write blocks until this
operation is finished so guest IO to not-yet-backed-up sectors will be
limited by the speed of the backup target.
With backup fleecing, such old data is cached in a fleecing image
rather than sent directly to the backup target. This can help guest IO
performance and even prevent hangs in certain scenarios, at the cost
of requiring more storage space.
With this series it will be possible to enable backup-fleecing via
e.g. `vzdump 123 --fleecing enabled=1,storage=local-lvm` with fleecing
images created on the storage `local-lvm`. The fleecing storage should
be a fast local storage which supports thin-provisioning and discard.
If the storage supports qcow2, that is used as the fleecing image
format. If the underlying file system does not support discard, with
qcow2 and preallocation=off, at least already allocated parts of the
image can be re-used later.
Fleecing images are created by qemu-server via pve-storage and
attached to QEMU before the backup starts, and cleaned up after the
backup finished or failed. The naming schema for fleecing images is
'vm-ID-fleece-N(.FORMAT)'. The allocated images are recorded in the
guest configuration, so that even after a hard failure, clean-up can
be re-attempted. While not too bad, it's a non-trivial amount of code
and I'm not 100% sure about the cost-benefit, so sending those as RFC.
The fleecing image needs to be the exact same size as the source, but
luckily, an explicit size can be specified when attaching a raw image
to QEMU so there are no size issues when using storages that have
coarser allocation/round up. For qcow2, it seems that virtual size can
be nearly arbitrary (i.e. modulo 512 byte granularity) during
allocation.
[0]: https://lists.proxmox.com/pipermail/pve-devel/2024-April/062815.html
Originally-by: Fiona Ebner <f.ebner@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
With pvebackup_propagate_error(), the first error wins. When one job
in the transaction fails, it is expected that later jobs get the
ECANCELED error. Those are not interesting and by skipping them a more
interesting error, which is likely the actual root cause, can win.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
All callers of the function pass an address, so dereferencing once
before checking for NULL is required. It's also necessary to update
bytes and offset nevertheless, so the request will actually be aligned
later and not trigger an assertion failure.
Seems like this was accidentally broken in 8dca018 ("udpate and rebase
to QEMU v6.0.0") and this is effectively a revert to the original
version of the patch. The qiov functions changed back then, which
might've been the reason Stefan tried to simplify the patch.
Should fix live-import for certain kinds of VMDK images.
Reported-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Backported from commit bfa36802d1 ("virtio-blk: avoid using ioeventfd
state in irqfd conditional") because the rework/rename dataplane ->
ioeventfd didn't happen yet.
Reported in the community forum [0] and reproduced doing a backup loop
to PBS with suspend mode with fio doing heavy IO in the guest and
using an RBD storage (with krbd).
[0]: https://forum.proxmox.com/threads/141320
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
In many configurations, e.g. multiple vNICs with multiple queues or
with many Ceph OSDs, the default soft limit of 1024 is not enough.
QEMU is supposed to work fine with file descriptors >= 1024 and does
not use select() on POSIX. Bump the soft limit to the allowed hard
limit to avoid issues with the aforementioned configurations.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
This essentially repeats commit 6b7c181 ("add patch to work around
stuck guest IO with iothread and VirtIO block/SCSI") with an added
fix for the SCSI event virtqueue, which requires special handling.
This is to avoid the issue [3] that made the revert 2a49e66 ("Revert
"add patch to work around stuck guest IO with iothread and VirtIO
block/SCSI"") necessary the first time around.
When using iothread, after commits
1665d9326f ("virtio-blk: implement BlockDevOps->drained_begin()")
766aa2de0f ("virtio-scsi: implement BlockDevOps->drained_begin()")
it can happen that polling gets stuck when draining. This would cause
IO in the guest to get completely stuck.
A workaround for users is stopping and resuming the vCPUs because that
would also stop and resume the dataplanes which would kick the host
notifiers.
This can happen with block jobs like backup and drive mirror as well
as with hotplug [2].
Reports in the community forum that might be about this issue[0][1]
and there is also one in the enterprise support channel.
As a workaround in the code, just re-enable notifications and kick the
virt queue after draining. Draining is already costly and rare, so no
need to worry about a performance penalty here.
Take special care to attach the SCSI event virtqueue host notifier
with the _no_poll() variant like in virtio_scsi_dataplane_start().
This avoids the issue from the first attempted fix where the iothread
would suddenly loop with 100% CPU usage whenever some guest IO came in
[3]. This is necessary because of commit 38738f7dbb ("virtio-scsi:
don't waste CPU polling the event virtqueue"). See [4] for the
relevant discussion.
[0]: https://forum.proxmox.com/threads/137286/
[1]: https://forum.proxmox.com/threads/137536/
[2]: https://issues.redhat.com/browse/RHEL-3934
[3]: https://forum.proxmox.com/threads/138140/
[4]: https://lore.kernel.org/qemu-devel/bfc7b20c-2144-46e9-acbc-e726276c5a31@proxmox.com/
Link: https://lore.kernel.org/qemu-devel/20240202153158.788922-1-hreitz@redhat.com/
Originally-by: Fiona Ebner <f.ebner@proxmox.com>
[ TL: Update to v2 and rebased patch series handling to v8.1.5 ]
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Most notable fixes from a Proxmox VE perspective are:
* "virtio-net: correctly copy vnet header when flushing TX"
To prevent a stack overflow that could lead to leaking parts of the
QEMU process's memory.
* "hw/pflash: implement update buffer for block writes"
To prevent an edge case for half-completed writes. This potentially
affected EFI disks.
* Fixes to i386 emulation and ARM emulation.
No changes for patches were necessary (all are just automatic context
changes).
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
This reverts commit 6b7c1815e1.
The attempted fix has been reported to cause high CPU usage after
backup [0]. Not difficult to reproduce and it's iothreads getting
stuck in a loop. Downgrading to pve-qemu-kvm=8.1.2-4 helps which was
also verified by Christian, thanks! The issue this was supposed to fix
is much rarer, so revert for now, while upstream is still working on a
proper fix.
[0]: https://forum.proxmox.com/threads/138140/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
While the patch gives bdrv_graph_wrlock() as an example where the
issue can manifest, something similar can happen even when that is
disabled. Was able to reproduce the issue with
while true; do qm resize 115 scsi0 +4M; sleep 1; done
while running
fio --name=make-mirror-work --size=100M --direct=1 --rw=randwrite \
--bs=4k --ioengine=psync --numjobs=5 --runtime=1200 --time_based
in the VM.
Fix picked up from:
https://lists.nongnu.org/archive/html/qemu-devel/2023-12/msg01102.html
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
When using iothread, after commits
1665d9326f ("virtio-blk: implement BlockDevOps->drained_begin()")
766aa2de0f ("virtio-scsi: implement BlockDevOps->drained_begin()")
it can happen that polling gets stuck when draining. This would cause
IO in the guest to get completely stuck.
A workaround for users is stopping and resuming the vCPUs because that
would also stop and resume the dataplanes which would kick the host
notifiers.
This can happen with block jobs like backup and drive mirror as well
as with hotplug [2].
Reports in the community forum that might be about this issue[0][1]
and there is also one in the enterprise support channel.
As a workaround in the code, just re-enable notifications and kick the
virt queue after draining. Draining is already costly and rare, so no
need to worry about a performance penalty here. This was taken from
the following comment of a QEMU developer [3] (in my debugging,
I had already found re-enabling notification to work around the issue,
but also kicking the queue is more complete).
[0]: https://forum.proxmox.com/threads/137286/
[1]: https://forum.proxmox.com/threads/137536/
[2]: https://issues.redhat.com/browse/RHEL-3934
[3]: https://issues.redhat.com/browse/RHEL-3934?focusedId=23562096&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-23562096
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
This fixes the host->guest direction with noNVC as a client (and
likely others).
Reported-by: Friedrich Weber <f.weber@proxmox.com>
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Tested-by: Friedrich Weber <f.weber@proxmox.com>
The issue prevented FreeBSD 14 VMs with SATA disk from booting.
The commit it fixes e2a5d9b3d9c3 ("hw/ide/ahci: simplify and document
PxCI handling") is part of stable 8.1.2.
The patch was already applied to the block branch upstream:
https://lists.nongnu.org/archive/html/qemu-devel/2023-11/msg02711.html
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Tested-by: Friedrich Weber <f.weber@proxmox.com>
As reported in the community forum [0] and reproduced locally this
breaks VirtIO network adapters in (at least) the German ISO of Windows
Server 2022. The fix itself was for
> Issue is not fatal but as result acpi-index/"PCI Label ID" property
> is either not shown in device details page or shows incorrect value.
so revert and tolerate that as a stop-gap, rather than have the
devices not working at all.
[0]: https://forum.proxmox.com/threads/92094/post-605684
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
The implementation of the helper is_path_tmpfs() is similar to the
existing qemu_fd_getfs() function in util/mmap-alloc.c, which
unfortunately only takes an existing fd.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Seems to be required since commit 81e2b198a8 ("configure: create a
python venv unconditionally").
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Upstream QEMU commit 4271f40383 ("virtio-net: correctly report maximum
tx_queue_size value") made setting an invalid tx_queue_size for a
non-vDPA/vhost-user net device a hard error. Now, qemu-server before
commit 089aed81 ("cfg2cmd: netdev: fix value for tx_queue_size") did
just that, so the newer QEMU version would break start-up for most VMs
(a default vNIC configuration would be affected).
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Taking a snapshot became prohibitively slow because of the
migration_transferred_bytes() call in migration_rate_exceeded() [0].
This also applied to the async snapshot taking in Proxmox VE, so
work around the issue until it is fixed upstream.
[0]: https://gitlab.com/qemu-project/qemu/-/issues/1821
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Bigger notable changes:
* Commit 1a30b0f5d7 ("block: .bdrv_open is non-coroutine and
unlocked") broke the PVE backup patches, in particular setting up
the backup dump block driver, because bdrv_new_open_driver() cannot
be called from a coroutine. To fix it, bdrv_co_open() is used
instead, and while it's a much more involved function, the result
should be essentially the same. The only difference I noticed is
that the BDRV_O_ALLOW_RDWR flag is also set in the resulting bds
(block driver state), but that shouldn't hurt.
Smaller notable changes:
* aio_set_fd_handler() dropped its 'is_external' parameter stating
that all callers now pass false in 60f782b6b7 ("aio: remove
aio_disable_external() API"). The calls in the PVE patches also
passed false, so just drop the parameter too.
* global_state_store() does not have a return value anymore, so the
user in the PVE savevm-async patch was adapted. For context, see
c33f1829f8 ("migration: never fail in global_state_store()").
* Renames affecting the PVE savevm-async patch:
migrate_use_block() -> migrate_block() and ram_counters -> mig_stats
9d4b1e5f22 ("migration: Move migrate_use_block() to options.c")
aff3f6606d ("migration: Rename ram_counters to mig_stats")
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>