This version includes both the AioContext lock and the block graph
lock, so there might be some deadlocks lurking. It's not possible to
disable the block graph lock like was done in QEMU 8.1, because there
are no changes like the function bdrv_schedule_unref() that require
it. QEMU 9.0 will finally get rid of the AioContext locking.
During live-restore with a VirtIO SCSI drive with iothread there is a
known racy deadlock related to the AioContext lock. Not new [1], but
not sure if more likely now. Should be fixed in QEMU 9.0.
The block graph lock comes with annotations that can be checked by
clang's TSA. This required changes to the block drivers, i.e.
alloc-track, pbs, zeroinit as well as taking the appropriate locks
in pve-backup, savevm-async, vma-reader.
Local variable shadowing is prohibited via a compiler flag now,
required slight adaptation in vma.c.
Major changes only affect alloc-track:
* It is not possible to call a generated co-wrapper like
bdrv_get_info() while holding the block graph lock exclusively [0],
which does happen during initialization of alloc-track when the
backing hd is set and the refresh_limits driver callback is invoked.
The bdrv_get_info() call to get the cluster size is moved to
directly after opening the file child in track_open().
The important thing is that at least the request alignment for the
write target is used, because then the RMW cycle in bdrv_pwritev
will gather enough data from the backing file. Partial cluster
allocations in the target are not a fundamental issue, because the
driver returns its allocation status based on the bitmap, so any
other data that maps to the same cluster will still be copied later
by a stream job (or during writes to that cluster).
* Replacing the node cannot be done in the
track_co_change_backing_file() callback, because it is a coroutine
and cannot hold the block graph lock exclusively. So it is moved to
the stream job itself with the auto-remove option not having an
effect anymore (qemu-server would always set it anyways).
In the future, there could either be a special option for the stream
job, or maybe the upcoming blockdev-replace QMP command can be used.
Replacing the backing child is actually already done in the stream
job, so no need to do it in the track_co_change_backing_file()
callback. It also cannot be called from a coroutine. Looking at the
implementation in the qcow2 driver, it doesn't seem to be intended
to change the backing child itself, just update driver-internal
state.
Other changes:
* alloc-track: Error out early when used without auto-remove. Since
replacing the node now happens in the stream job, where the option
cannot be read from (it's internal to the driver), it will always be
treated as 'on'. Makes sure to have users beside qemu-server notice
the change (should they even exist). The option can be fully dropped
in the future while adding a version guard in qemu-server.
* alloc-track: Avoid seemingly superfluous child permission update.
Doesn't seem necessary nowadays (maybe after commit "alloc-track:
fix deadlock during drop" where the dropping is not rescheduled and
delayed anymore or some upstream change). Replacing the block node
will already update the permissions of the new node (which was the
file child before). Should there really be some issue, instead of
having a drop state, this could also be just based off the fact
whether there is still a backing child.
Dumping the cumulative (shared) permissions for the BDS with a debug
print yields the same values after this patch and with QEMU 8.1,
namely 3 and 5.
* PBS block driver: compile unconditionally. Proxmox VE always needs
it and something in the build process changed to make it not enabled
by default. Probably would need to move the build option to meson
otherwise.
* backup: job unreferencing during cleanup needs to happen outside of
coroutine, so it was moved to before invoking the clean
* mirror: Cherry-pick stable fix to avoid potential deadlock.
* savevm-async: migrate_init now can fail, so propagate potential
error.
* savevm-async: compression counters are not accessible outside
migration/ram-compress now, so drop code that prophylactically set
it to zero.
[0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/
[1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Most notable fixes from a Proxmox VE perspective are:
* "virtio-net: correctly copy vnet header when flushing TX"
To prevent a stack overflow that could lead to leaking parts of the
QEMU process's memory.
* "hw/pflash: implement update buffer for block writes"
To prevent an edge case for half-completed writes. This potentially
affected EFI disks.
* Fixes to i386 emulation and ARM emulation.
No changes for patches were necessary (all are just automatic context
changes).
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Bigger notable changes:
* Commit 1a30b0f5d7 ("block: .bdrv_open is non-coroutine and
unlocked") broke the PVE backup patches, in particular setting up
the backup dump block driver, because bdrv_new_open_driver() cannot
be called from a coroutine. To fix it, bdrv_co_open() is used
instead, and while it's a much more involved function, the result
should be essentially the same. The only difference I noticed is
that the BDRV_O_ALLOW_RDWR flag is also set in the resulting bds
(block driver state), but that shouldn't hurt.
Smaller notable changes:
* aio_set_fd_handler() dropped its 'is_external' parameter stating
that all callers now pass false in 60f782b6b7 ("aio: remove
aio_disable_external() API"). The calls in the PVE patches also
passed false, so just drop the parameter too.
* global_state_store() does not have a return value anymore, so the
user in the PVE savevm-async patch was adapted. For context, see
c33f1829f8 ("migration: never fail in global_state_store()").
* Renames affecting the PVE savevm-async patch:
migrate_use_block() -> migrate_block() and ram_counters -> mig_stats
9d4b1e5f22 ("migration: Move migrate_use_block() to options.c")
aff3f6606d ("migration: Rename ram_counters to mig_stats")
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Since upstream QEMU 8.0, it's no longer possible to call
bdrv_img_create() from a coroutine anymore, meaning a backup with the
directory format would crash the QEMU instance.
The feature is only exposed via the monitor and was intended to be
experimental. There were no user reports about the breakage and it
only was noticed during the rebase for QEMU 8.1, because other parts
of the backup code needed adaptation and I decided to check the
BACKUP_FORMAT_DIR case too.
It should not stay in a broken state of course, but avoid the
maintenance cost and just make it a removed feature for Proxmox VE 8
retroactively.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Notable changes:
* The only big change is the switch to using a custom QIOChannel for
savevm-async, because the previously used QEMUFileOps was dropped.
Changes to the current implementation:
* Switch to vector based methods as required for an IO channel. For
short reads the passed-in IO vector is stuffed with zeroes at the
end, just to be sure.
* For reading: The documentation in include/io/channel.h states that
at least one byte should be read, so also error out when whe are
at the very end instead of returning 0.
* For reading: Fix off-by-one error when request goes beyond end.
The wrong code piece was:
if ((pos + size) > maxlen) {
size = maxlen - pos - 1;
}
Previously, the last byte would not be read. It's actually
possible to get a snapshot .raw file that has content all the way
up the final 512 byte (= BDRV_SECTOR_SIZE) boundary without any
trailing zero bytes (I wrote a script to do it).
Luckily, it didn't cause a real issue, because qemu_loadvm_state()
is not interested in the final (i.e. QEMU_VM_VMDESCRIPTION)
section. The buffer for reading it is simply freed up afterwards
and the function will assume that it read the whole section, even
if that's not the case.
* For writing: Make use of the generated blk_pwritev() wrapper
instead of manually wrapping the coroutine to simplify and save a
few lines.
* Adapt to changed interfaces for blk_{pread,pwrite}:
* a9262f551e ("block: Change blk_{pread,pwrite}() param order")
* 3b35d4542c ("block: Add a 'flags' param to blk_pread()")
* bf5b16fa40 ("block: Make blk_{pread,pwrite}() return 0 on success")
Those changes especially affected the qemu-img dd patches, because
the context also changed, but also some of our block drivers used
the functions.
* Drop qemu-common.h include: it got renamed after essentially
everything was moved to other headers. The only remaining user I
could find for things dropped from the header between 7.0 and 7.1
was qemu_get_vm_name() in the iscsi-initiatorname patch, but it
already includes the header to which the function was moved.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Only very minor changes needed:
* Most patches in extra (or some version of them) are part of 7.0.0.
* aio_set_fd_handler got an extra parameter, but can just pass NULL
like we did for the related 'poll' parameter. See QEMU commit
826cc32423db2a99d184dbf4f507c737d7e7a4ae for more.
* Add include for qemu/memalign.h in vma.c and vma-writer.c.
* Add reverts for fixups of already reverted 0347a8fd4c ("block/rbd:
implement bdrv_co_block_status") that came in with 7.0.0. Those
fixups are not enough, see Proxmox bugzilla #4047.
* Two trivial context changes for bitmap-mirror patches.
* block_int.h got split up into multiple headers.
* Some context changes in configure and meson.build.
* Used the oppurtunity to squash fixup of bdrv_backuo_dump_create typo
in a later patch into the patch introducing the function (had to
move code to new header during rebase).
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
This is necessary for multi-disk backups where not all jobs are
immediately started after they are created. QEMU commit
06e0a9c16405c0a4c1eca33cf286cc04c42066a2 did already part of the work,
ensuring that new writes after job creation don't pass through to the
backup, but not yet for the MIRROR_SYNC_MODE_BITMAP case which is used
for PBS.
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
This drops debian/patches/pve/0005-PVE-Config-smm_available-false.patch
(and renumbers the remaining patches)
From what I could gather, this patch was originally added
due to issues with old kernels. Now we have users which
seem to run into issues *with* the patch.
All this does is toggle an option, and it's available via a
qemu CLI option anyway, so if dropping this patch causes
issues for some people we can just add an option to
qemu-server & UI control smm explicitly.
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Cc: Alexandre Derumier <aderumier@odiso.com>
Tested-by: Stefan Reiter <s.reiter@proxmox.com>
Mostly minor changes, bigger ones summarized:
* QEMU's internal backup code now uses a new async system, which allows
parallel requests - the default max_workers settings is 64, I chose
less, since 64 put enough stress on QEMU that the guest became
practically unusable during the backup, and 16 still shows quite a
nice measureable performance improvement. Little code changes for us
though.
* 'malformed' QAPI parameters/functions are now a build error (i.e.
using '_' vs '-'), I chose to just whitelist our calls in the name of
backwards compatibility.
* monitor OOB race fix now uses the upstream variant, cherry-picked from
origin/master since it's not in 6.0 by default
* last patch fixes a bug with snapshot rollback related to the new yank
system
Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
...instead of having them in the middle of the backup related patches.
These might (hopefully) become upstream at some point as well.
Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>