2018-02-19 12:38:54 +03:00
|
|
|
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
|
2020-03-10 17:12:50 +03:00
|
|
|
From: Dietmar Maurer <dietmar@proxmox.com>
|
2020-04-07 17:53:19 +03:00
|
|
|
Date: Mon, 6 Apr 2020 12:16:46 +0200
|
2021-02-11 19:11:11 +03:00
|
|
|
Subject: [PATCH] PVE: add savevm-async for background state snapshots
|
2017-04-05 11:49:19 +03:00
|
|
|
|
2020-07-02 14:07:28 +03:00
|
|
|
Put qemu_savevm_state_{header,setup} into the main loop and the rest
|
|
|
|
of the iteration into a coroutine. The former need to lock the
|
|
|
|
iothread (and we can't unlock it in the coroutine), and the latter
|
|
|
|
can't deal with being in a separate thread, so a coroutine it must
|
|
|
|
be.
|
|
|
|
|
2021-02-11 19:11:11 +03:00
|
|
|
Truncate output file at 1024 boundary.
|
|
|
|
|
|
|
|
Do not block the VM and save the state on aborting a snapshot, as the
|
|
|
|
snapshot will be invalid anyway.
|
|
|
|
|
|
|
|
Also, when aborting, wait for the target file to be closed, otherwise a
|
|
|
|
client might run into race-conditions when trying to remove the file
|
|
|
|
still opened by QEMU.
|
|
|
|
|
2019-06-06 13:58:15 +03:00
|
|
|
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
|
2020-03-10 17:12:50 +03:00
|
|
|
Signed-off-by: Dietmar Maurer <dietmar@proxmox.com>
|
2020-07-02 14:07:28 +03:00
|
|
|
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
|
squash related patches
where there is no good reason to keep them separate. It's a pain
during rebase if there are multiple patches changing the same code
over and over again. This was especially bad for the backup-related
patches. If the history of patches really is needed, it can be
extracted via git. Additionally, compilation with partial application
of patches was broken since a long time, because one of the master key
changes became part of an earlier patch during a past rebase.
If only the same files were changed by a subsequent patch and the
changes felt to belong together (obvious for later bug fixes, but also
done for features e.g. adding master key support for PBS), the patches
were squashed together.
The PBS namespace support patch was split into the individual parts
it changes, i.e. PBS block driver, pbs-restore binary and QMP backup
infrastructure, and squashed into the respective patches.
No code change is intended, git diff in the submodule should not show
any difference between applying all patches before this commit and
applying all patches after this commit.
The query-proxmox-support QMP function has been left as part of the
"PVE-Backup: Proxmox backup patches for QEMU" patch, because it's
currently only used there. If it ever is used elsewhere too, it can
be split out from there.
The recent alloc-track and BQL-related savevm-async changes have been
left separate for now, because it's not 100% clear they are the best
approach yet. This depends on what upstream decides about the BQL
stuff and whether and what kind of issues with the changes pop up.
The qemu-img dd snapshot patch has been re-ordered to after the other
qemu-img dd patches.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2023-05-15 16:39:56 +03:00
|
|
|
[SR: improve aborting
|
|
|
|
register yank before migration_incoming_state_destroy]
|
2021-02-11 19:11:11 +03:00
|
|
|
Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
|
update submodule and patches to 7.1.0
Notable changes:
* The only big change is the switch to using a custom QIOChannel for
savevm-async, because the previously used QEMUFileOps was dropped.
Changes to the current implementation:
* Switch to vector based methods as required for an IO channel. For
short reads the passed-in IO vector is stuffed with zeroes at the
end, just to be sure.
* For reading: The documentation in include/io/channel.h states that
at least one byte should be read, so also error out when whe are
at the very end instead of returning 0.
* For reading: Fix off-by-one error when request goes beyond end.
The wrong code piece was:
if ((pos + size) > maxlen) {
size = maxlen - pos - 1;
}
Previously, the last byte would not be read. It's actually
possible to get a snapshot .raw file that has content all the way
up the final 512 byte (= BDRV_SECTOR_SIZE) boundary without any
trailing zero bytes (I wrote a script to do it).
Luckily, it didn't cause a real issue, because qemu_loadvm_state()
is not interested in the final (i.e. QEMU_VM_VMDESCRIPTION)
section. The buffer for reading it is simply freed up afterwards
and the function will assume that it read the whole section, even
if that's not the case.
* For writing: Make use of the generated blk_pwritev() wrapper
instead of manually wrapping the coroutine to simplify and save a
few lines.
* Adapt to changed interfaces for blk_{pread,pwrite}:
* a9262f551e ("block: Change blk_{pread,pwrite}() param order")
* 3b35d4542c ("block: Add a 'flags' param to blk_pread()")
* bf5b16fa40 ("block: Make blk_{pread,pwrite}() return 0 on success")
Those changes especially affected the qemu-img dd patches, because
the context also changed, but also some of our block drivers used
the functions.
* Drop qemu-common.h include: it got renamed after essentially
everything was moved to other headers. The only remaining user I
could find for things dropped from the header between 7.0 and 7.1
was qemu_get_vm_name() in the iscsi-initiatorname patch, but it
already includes the header to which the function was moved.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2022-10-14 15:07:13 +03:00
|
|
|
[FE: further improve aborting
|
2023-01-26 16:46:13 +03:00
|
|
|
adapt to removal of QEMUFileOps
|
update submodule and patches to QEMU 8.0.0
Many changes were necessary this time around:
* QAPI was changed to avoid redundant has_* variables, see commit
44ea9d9be3 ("qapi: Start to elide redundant has_FOO in generated C")
for details. This affected many QMP commands added by Proxmox too.
* Pending querying for migration got split into two functions, one to
estimate, one for exact value, see commit c8df4a7aef ("migration:
Split save_live_pending() into state_pending_*") for details. Relevant
for savevm-async and PBS dirty bitmap.
* Some block (driver) functions got converted to coroutines, so the
Proxmox block drivers needed to be adapted.
* Alloc track auto-detaching during PBS live restore got broken by
AioContext-related changes resulting in a deadlock. The current, hacky
method was replaced by a simpler one. Stefan apparently ran into a
problem with that when he wrote the driver, but there were
improvements in the stream job code since then and I didn't manage to
reproduce the issue. It's a separate patch "alloc-track: fix deadlock
during drop" for now, you can find the details there.
* Async snapshot-related changes:
- The pending querying got adapted to the above-mentioned split and
a patch is added to optimize it/make it more similar to what
upstream code does.
- Added initialization of the compression counters (for
future-proofing).
- It's necessary the hold the BQL (big QEMU lock = iothread mutex)
during the setup phase, because block layer functions are used there
and not doing so leads to racy, hard-to-debug crashes or hangs. It's
necessary to change some upstream code too for this, a version of
the patch "migration: for snapshots, hold the BQL during setup
callbacks" is intended to be upstreamed.
- Need to take the bdrv graph read lock before flushing.
* hmp_info_balloon was moved to a different file.
* Needed to include a new headers from time to time to still get the
correct functions.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2023-05-15 16:39:53 +03:00
|
|
|
improve condition for entering final stage
|
update submodule and patches to QEMU 8.2.2
This version includes both the AioContext lock and the block graph
lock, so there might be some deadlocks lurking. It's not possible to
disable the block graph lock like was done in QEMU 8.1, because there
are no changes like the function bdrv_schedule_unref() that require
it. QEMU 9.0 will finally get rid of the AioContext locking.
During live-restore with a VirtIO SCSI drive with iothread there is a
known racy deadlock related to the AioContext lock. Not new [1], but
not sure if more likely now. Should be fixed in QEMU 9.0.
The block graph lock comes with annotations that can be checked by
clang's TSA. This required changes to the block drivers, i.e.
alloc-track, pbs, zeroinit as well as taking the appropriate locks
in pve-backup, savevm-async, vma-reader.
Local variable shadowing is prohibited via a compiler flag now,
required slight adaptation in vma.c.
Major changes only affect alloc-track:
* It is not possible to call a generated co-wrapper like
bdrv_get_info() while holding the block graph lock exclusively [0],
which does happen during initialization of alloc-track when the
backing hd is set and the refresh_limits driver callback is invoked.
The bdrv_get_info() call to get the cluster size is moved to
directly after opening the file child in track_open().
The important thing is that at least the request alignment for the
write target is used, because then the RMW cycle in bdrv_pwritev
will gather enough data from the backing file. Partial cluster
allocations in the target are not a fundamental issue, because the
driver returns its allocation status based on the bitmap, so any
other data that maps to the same cluster will still be copied later
by a stream job (or during writes to that cluster).
* Replacing the node cannot be done in the
track_co_change_backing_file() callback, because it is a coroutine
and cannot hold the block graph lock exclusively. So it is moved to
the stream job itself with the auto-remove option not having an
effect anymore (qemu-server would always set it anyways).
In the future, there could either be a special option for the stream
job, or maybe the upcoming blockdev-replace QMP command can be used.
Replacing the backing child is actually already done in the stream
job, so no need to do it in the track_co_change_backing_file()
callback. It also cannot be called from a coroutine. Looking at the
implementation in the qcow2 driver, it doesn't seem to be intended
to change the backing child itself, just update driver-internal
state.
Other changes:
* alloc-track: Error out early when used without auto-remove. Since
replacing the node now happens in the stream job, where the option
cannot be read from (it's internal to the driver), it will always be
treated as 'on'. Makes sure to have users beside qemu-server notice
the change (should they even exist). The option can be fully dropped
in the future while adding a version guard in qemu-server.
* alloc-track: Avoid seemingly superfluous child permission update.
Doesn't seem necessary nowadays (maybe after commit "alloc-track:
fix deadlock during drop" where the dropping is not rescheduled and
delayed anymore or some upstream change). Replacing the block node
will already update the permissions of the new node (which was the
file child before). Should there really be some issue, instead of
having a drop state, this could also be just based off the fact
whether there is still a backing child.
Dumping the cumulative (shared) permissions for the BDS with a debug
print yields the same values after this patch and with QEMU 8.1,
namely 3 and 5.
* PBS block driver: compile unconditionally. Proxmox VE always needs
it and something in the build process changed to make it not enabled
by default. Probably would need to move the build option to meson
otherwise.
* backup: job unreferencing during cleanup needs to happen outside of
coroutine, so it was moved to before invoking the clean
* mirror: Cherry-pick stable fix to avoid potential deadlock.
* savevm-async: migrate_init now can fail, so propagate potential
error.
* savevm-async: compression counters are not accessible outside
migration/ram-compress now, so drop code that prophylactically set
it to zero.
[0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/
[1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
|
|
|
adapt to QAPI and other changes for 8.2]
|
2022-08-18 14:44:16 +03:00
|
|
|
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
|
2017-04-05 11:49:19 +03:00
|
|
|
---
|
2019-06-06 13:58:15 +03:00
|
|
|
hmp-commands-info.hx | 13 +
|
2023-05-24 16:56:53 +03:00
|
|
|
hmp-commands.hx | 17 ++
|
2021-05-27 13:43:32 +03:00
|
|
|
include/migration/snapshot.h | 2 +
|
2023-05-24 16:56:53 +03:00
|
|
|
include/monitor/hmp.h | 3 +
|
2021-02-11 19:11:11 +03:00
|
|
|
migration/meson.build | 1 +
|
update submodule and patches to QEMU 8.2.2
This version includes both the AioContext lock and the block graph
lock, so there might be some deadlocks lurking. It's not possible to
disable the block graph lock like was done in QEMU 8.1, because there
are no changes like the function bdrv_schedule_unref() that require
it. QEMU 9.0 will finally get rid of the AioContext locking.
During live-restore with a VirtIO SCSI drive with iothread there is a
known racy deadlock related to the AioContext lock. Not new [1], but
not sure if more likely now. Should be fixed in QEMU 9.0.
The block graph lock comes with annotations that can be checked by
clang's TSA. This required changes to the block drivers, i.e.
alloc-track, pbs, zeroinit as well as taking the appropriate locks
in pve-backup, savevm-async, vma-reader.
Local variable shadowing is prohibited via a compiler flag now,
required slight adaptation in vma.c.
Major changes only affect alloc-track:
* It is not possible to call a generated co-wrapper like
bdrv_get_info() while holding the block graph lock exclusively [0],
which does happen during initialization of alloc-track when the
backing hd is set and the refresh_limits driver callback is invoked.
The bdrv_get_info() call to get the cluster size is moved to
directly after opening the file child in track_open().
The important thing is that at least the request alignment for the
write target is used, because then the RMW cycle in bdrv_pwritev
will gather enough data from the backing file. Partial cluster
allocations in the target are not a fundamental issue, because the
driver returns its allocation status based on the bitmap, so any
other data that maps to the same cluster will still be copied later
by a stream job (or during writes to that cluster).
* Replacing the node cannot be done in the
track_co_change_backing_file() callback, because it is a coroutine
and cannot hold the block graph lock exclusively. So it is moved to
the stream job itself with the auto-remove option not having an
effect anymore (qemu-server would always set it anyways).
In the future, there could either be a special option for the stream
job, or maybe the upcoming blockdev-replace QMP command can be used.
Replacing the backing child is actually already done in the stream
job, so no need to do it in the track_co_change_backing_file()
callback. It also cannot be called from a coroutine. Looking at the
implementation in the qcow2 driver, it doesn't seem to be intended
to change the backing child itself, just update driver-internal
state.
Other changes:
* alloc-track: Error out early when used without auto-remove. Since
replacing the node now happens in the stream job, where the option
cannot be read from (it's internal to the driver), it will always be
treated as 'on'. Makes sure to have users beside qemu-server notice
the change (should they even exist). The option can be fully dropped
in the future while adding a version guard in qemu-server.
* alloc-track: Avoid seemingly superfluous child permission update.
Doesn't seem necessary nowadays (maybe after commit "alloc-track:
fix deadlock during drop" where the dropping is not rescheduled and
delayed anymore or some upstream change). Replacing the block node
will already update the permissions of the new node (which was the
file child before). Should there really be some issue, instead of
having a drop state, this could also be just based off the fact
whether there is still a backing child.
Dumping the cumulative (shared) permissions for the BDS with a debug
print yields the same values after this patch and with QEMU 8.1,
namely 3 and 5.
* PBS block driver: compile unconditionally. Proxmox VE always needs
it and something in the build process changed to make it not enabled
by default. Probably would need to move the build option to meson
otherwise.
* backup: job unreferencing during cleanup needs to happen outside of
coroutine, so it was moved to before invoking the clean
* mirror: Cherry-pick stable fix to avoid potential deadlock.
* savevm-async: migrate_init now can fail, so propagate potential
error.
* savevm-async: compression counters are not accessible outside
migration/ram-compress now, so drop code that prophylactically set
it to zero.
[0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/
[1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
|
|
|
migration/savevm-async.c | 534 +++++++++++++++++++++++++++++++++++
|
2023-05-24 16:56:53 +03:00
|
|
|
monitor/hmp-cmds.c | 38 +++
|
update submodule and patches to 7.1.0
Notable changes:
* The only big change is the switch to using a custom QIOChannel for
savevm-async, because the previously used QEMUFileOps was dropped.
Changes to the current implementation:
* Switch to vector based methods as required for an IO channel. For
short reads the passed-in IO vector is stuffed with zeroes at the
end, just to be sure.
* For reading: The documentation in include/io/channel.h states that
at least one byte should be read, so also error out when whe are
at the very end instead of returning 0.
* For reading: Fix off-by-one error when request goes beyond end.
The wrong code piece was:
if ((pos + size) > maxlen) {
size = maxlen - pos - 1;
}
Previously, the last byte would not be read. It's actually
possible to get a snapshot .raw file that has content all the way
up the final 512 byte (= BDRV_SECTOR_SIZE) boundary without any
trailing zero bytes (I wrote a script to do it).
Luckily, it didn't cause a real issue, because qemu_loadvm_state()
is not interested in the final (i.e. QEMU_VM_VMDESCRIPTION)
section. The buffer for reading it is simply freed up afterwards
and the function will assume that it read the whole section, even
if that's not the case.
* For writing: Make use of the generated blk_pwritev() wrapper
instead of manually wrapping the coroutine to simplify and save a
few lines.
* Adapt to changed interfaces for blk_{pread,pwrite}:
* a9262f551e ("block: Change blk_{pread,pwrite}() param order")
* 3b35d4542c ("block: Add a 'flags' param to blk_pread()")
* bf5b16fa40 ("block: Make blk_{pread,pwrite}() return 0 on success")
Those changes especially affected the qemu-img dd patches, because
the context also changed, but also some of our block drivers used
the functions.
* Drop qemu-common.h include: it got renamed after essentially
everything was moved to other headers. The only remaining user I
could find for things dropped from the header between 7.0 and 7.1
was qemu_get_vm_name() in the iscsi-initiatorname patch, but it
already includes the header to which the function was moved.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2022-10-14 15:07:13 +03:00
|
|
|
qapi/migration.json | 34 +++
|
2023-05-24 16:56:53 +03:00
|
|
|
qapi/misc.json | 16 ++
|
2020-04-07 17:53:19 +03:00
|
|
|
qemu-options.hx | 12 +
|
update submodule and patches to QEMU 8.2.2
This version includes both the AioContext lock and the block graph
lock, so there might be some deadlocks lurking. It's not possible to
disable the block graph lock like was done in QEMU 8.1, because there
are no changes like the function bdrv_schedule_unref() that require
it. QEMU 9.0 will finally get rid of the AioContext locking.
During live-restore with a VirtIO SCSI drive with iothread there is a
known racy deadlock related to the AioContext lock. Not new [1], but
not sure if more likely now. Should be fixed in QEMU 9.0.
The block graph lock comes with annotations that can be checked by
clang's TSA. This required changes to the block drivers, i.e.
alloc-track, pbs, zeroinit as well as taking the appropriate locks
in pve-backup, savevm-async, vma-reader.
Local variable shadowing is prohibited via a compiler flag now,
required slight adaptation in vma.c.
Major changes only affect alloc-track:
* It is not possible to call a generated co-wrapper like
bdrv_get_info() while holding the block graph lock exclusively [0],
which does happen during initialization of alloc-track when the
backing hd is set and the refresh_limits driver callback is invoked.
The bdrv_get_info() call to get the cluster size is moved to
directly after opening the file child in track_open().
The important thing is that at least the request alignment for the
write target is used, because then the RMW cycle in bdrv_pwritev
will gather enough data from the backing file. Partial cluster
allocations in the target are not a fundamental issue, because the
driver returns its allocation status based on the bitmap, so any
other data that maps to the same cluster will still be copied later
by a stream job (or during writes to that cluster).
* Replacing the node cannot be done in the
track_co_change_backing_file() callback, because it is a coroutine
and cannot hold the block graph lock exclusively. So it is moved to
the stream job itself with the auto-remove option not having an
effect anymore (qemu-server would always set it anyways).
In the future, there could either be a special option for the stream
job, or maybe the upcoming blockdev-replace QMP command can be used.
Replacing the backing child is actually already done in the stream
job, so no need to do it in the track_co_change_backing_file()
callback. It also cannot be called from a coroutine. Looking at the
implementation in the qcow2 driver, it doesn't seem to be intended
to change the backing child itself, just update driver-internal
state.
Other changes:
* alloc-track: Error out early when used without auto-remove. Since
replacing the node now happens in the stream job, where the option
cannot be read from (it's internal to the driver), it will always be
treated as 'on'. Makes sure to have users beside qemu-server notice
the change (should they even exist). The option can be fully dropped
in the future while adding a version guard in qemu-server.
* alloc-track: Avoid seemingly superfluous child permission update.
Doesn't seem necessary nowadays (maybe after commit "alloc-track:
fix deadlock during drop" where the dropping is not rescheduled and
delayed anymore or some upstream change). Replacing the block node
will already update the permissions of the new node (which was the
file child before). Should there really be some issue, instead of
having a drop state, this could also be just based off the fact
whether there is still a backing child.
Dumping the cumulative (shared) permissions for the BDS with a debug
print yields the same values after this patch and with QEMU 8.1,
namely 3 and 5.
* PBS block driver: compile unconditionally. Proxmox VE always needs
it and something in the build process changed to make it not enabled
by default. Probably would need to move the build option to meson
otherwise.
* backup: job unreferencing during cleanup needs to happen outside of
coroutine, so it was moved to before invoking the clean
* mirror: Cherry-pick stable fix to avoid potential deadlock.
* savevm-async: migrate_init now can fail, so propagate potential
error.
* savevm-async: compression counters are not accessible outside
migration/ram-compress now, so drop code that prophylactically set
it to zero.
[0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/
[1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
|
|
|
system/vl.c | 10 +
|
|
|
|
11 files changed, 680 insertions(+)
|
2021-02-11 19:11:11 +03:00
|
|
|
create mode 100644 migration/savevm-async.c
|
2017-04-05 11:49:19 +03:00
|
|
|
|
|
|
|
diff --git a/hmp-commands-info.hx b/hmp-commands-info.hx
|
2023-10-17 15:10:09 +03:00
|
|
|
index f5b37eb74a..10fdd822e0 100644
|
2017-04-05 11:49:19 +03:00
|
|
|
--- a/hmp-commands-info.hx
|
|
|
|
+++ b/hmp-commands-info.hx
|
2023-10-17 15:10:09 +03:00
|
|
|
@@ -525,6 +525,19 @@ SRST
|
2021-05-27 13:43:32 +03:00
|
|
|
Show current migration parameters.
|
2020-04-07 17:53:19 +03:00
|
|
|
ERST
|
|
|
|
|
2019-06-06 13:58:15 +03:00
|
|
|
+ {
|
2017-04-05 11:49:19 +03:00
|
|
|
+ .name = "savevm",
|
|
|
|
+ .args_type = "",
|
|
|
|
+ .params = "",
|
|
|
|
+ .help = "show savevm status",
|
2017-04-05 12:38:26 +03:00
|
|
|
+ .cmd = hmp_info_savevm,
|
2017-04-05 11:49:19 +03:00
|
|
|
+ },
|
|
|
|
+
|
2020-04-07 17:53:19 +03:00
|
|
|
+SRST
|
|
|
|
+ ``info savevm``
|
|
|
|
+ Show savevm status.
|
|
|
|
+ERST
|
|
|
|
+
|
2019-06-06 13:58:15 +03:00
|
|
|
{
|
2020-04-07 17:53:19 +03:00
|
|
|
.name = "balloon",
|
|
|
|
.args_type = "",
|
2017-04-05 11:49:19 +03:00
|
|
|
diff --git a/hmp-commands.hx b/hmp-commands.hx
|
update submodule and patches to QEMU 8.2.2
This version includes both the AioContext lock and the block graph
lock, so there might be some deadlocks lurking. It's not possible to
disable the block graph lock like was done in QEMU 8.1, because there
are no changes like the function bdrv_schedule_unref() that require
it. QEMU 9.0 will finally get rid of the AioContext locking.
During live-restore with a VirtIO SCSI drive with iothread there is a
known racy deadlock related to the AioContext lock. Not new [1], but
not sure if more likely now. Should be fixed in QEMU 9.0.
The block graph lock comes with annotations that can be checked by
clang's TSA. This required changes to the block drivers, i.e.
alloc-track, pbs, zeroinit as well as taking the appropriate locks
in pve-backup, savevm-async, vma-reader.
Local variable shadowing is prohibited via a compiler flag now,
required slight adaptation in vma.c.
Major changes only affect alloc-track:
* It is not possible to call a generated co-wrapper like
bdrv_get_info() while holding the block graph lock exclusively [0],
which does happen during initialization of alloc-track when the
backing hd is set and the refresh_limits driver callback is invoked.
The bdrv_get_info() call to get the cluster size is moved to
directly after opening the file child in track_open().
The important thing is that at least the request alignment for the
write target is used, because then the RMW cycle in bdrv_pwritev
will gather enough data from the backing file. Partial cluster
allocations in the target are not a fundamental issue, because the
driver returns its allocation status based on the bitmap, so any
other data that maps to the same cluster will still be copied later
by a stream job (or during writes to that cluster).
* Replacing the node cannot be done in the
track_co_change_backing_file() callback, because it is a coroutine
and cannot hold the block graph lock exclusively. So it is moved to
the stream job itself with the auto-remove option not having an
effect anymore (qemu-server would always set it anyways).
In the future, there could either be a special option for the stream
job, or maybe the upcoming blockdev-replace QMP command can be used.
Replacing the backing child is actually already done in the stream
job, so no need to do it in the track_co_change_backing_file()
callback. It also cannot be called from a coroutine. Looking at the
implementation in the qcow2 driver, it doesn't seem to be intended
to change the backing child itself, just update driver-internal
state.
Other changes:
* alloc-track: Error out early when used without auto-remove. Since
replacing the node now happens in the stream job, where the option
cannot be read from (it's internal to the driver), it will always be
treated as 'on'. Makes sure to have users beside qemu-server notice
the change (should they even exist). The option can be fully dropped
in the future while adding a version guard in qemu-server.
* alloc-track: Avoid seemingly superfluous child permission update.
Doesn't seem necessary nowadays (maybe after commit "alloc-track:
fix deadlock during drop" where the dropping is not rescheduled and
delayed anymore or some upstream change). Replacing the block node
will already update the permissions of the new node (which was the
file child before). Should there really be some issue, instead of
having a drop state, this could also be just based off the fact
whether there is still a backing child.
Dumping the cumulative (shared) permissions for the BDS with a debug
print yields the same values after this patch and with QEMU 8.1,
namely 3 and 5.
* PBS block driver: compile unconditionally. Proxmox VE always needs
it and something in the build process changed to make it not enabled
by default. Probably would need to move the build option to meson
otherwise.
* backup: job unreferencing during cleanup needs to happen outside of
coroutine, so it was moved to before invoking the clean
* mirror: Cherry-pick stable fix to avoid potential deadlock.
* savevm-async: migrate_init now can fail, so propagate potential
error.
* savevm-async: compression counters are not accessible outside
migration/ram-compress now, so drop code that prophylactically set
it to zero.
[0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/
[1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
|
|
|
index 765349ed14..893c3bd240 100644
|
2017-04-05 11:49:19 +03:00
|
|
|
--- a/hmp-commands.hx
|
|
|
|
+++ b/hmp-commands.hx
|
update submodule and patches to QEMU 8.2.2
This version includes both the AioContext lock and the block graph
lock, so there might be some deadlocks lurking. It's not possible to
disable the block graph lock like was done in QEMU 8.1, because there
are no changes like the function bdrv_schedule_unref() that require
it. QEMU 9.0 will finally get rid of the AioContext locking.
During live-restore with a VirtIO SCSI drive with iothread there is a
known racy deadlock related to the AioContext lock. Not new [1], but
not sure if more likely now. Should be fixed in QEMU 9.0.
The block graph lock comes with annotations that can be checked by
clang's TSA. This required changes to the block drivers, i.e.
alloc-track, pbs, zeroinit as well as taking the appropriate locks
in pve-backup, savevm-async, vma-reader.
Local variable shadowing is prohibited via a compiler flag now,
required slight adaptation in vma.c.
Major changes only affect alloc-track:
* It is not possible to call a generated co-wrapper like
bdrv_get_info() while holding the block graph lock exclusively [0],
which does happen during initialization of alloc-track when the
backing hd is set and the refresh_limits driver callback is invoked.
The bdrv_get_info() call to get the cluster size is moved to
directly after opening the file child in track_open().
The important thing is that at least the request alignment for the
write target is used, because then the RMW cycle in bdrv_pwritev
will gather enough data from the backing file. Partial cluster
allocations in the target are not a fundamental issue, because the
driver returns its allocation status based on the bitmap, so any
other data that maps to the same cluster will still be copied later
by a stream job (or during writes to that cluster).
* Replacing the node cannot be done in the
track_co_change_backing_file() callback, because it is a coroutine
and cannot hold the block graph lock exclusively. So it is moved to
the stream job itself with the auto-remove option not having an
effect anymore (qemu-server would always set it anyways).
In the future, there could either be a special option for the stream
job, or maybe the upcoming blockdev-replace QMP command can be used.
Replacing the backing child is actually already done in the stream
job, so no need to do it in the track_co_change_backing_file()
callback. It also cannot be called from a coroutine. Looking at the
implementation in the qcow2 driver, it doesn't seem to be intended
to change the backing child itself, just update driver-internal
state.
Other changes:
* alloc-track: Error out early when used without auto-remove. Since
replacing the node now happens in the stream job, where the option
cannot be read from (it's internal to the driver), it will always be
treated as 'on'. Makes sure to have users beside qemu-server notice
the change (should they even exist). The option can be fully dropped
in the future while adding a version guard in qemu-server.
* alloc-track: Avoid seemingly superfluous child permission update.
Doesn't seem necessary nowadays (maybe after commit "alloc-track:
fix deadlock during drop" where the dropping is not rescheduled and
delayed anymore or some upstream change). Replacing the block node
will already update the permissions of the new node (which was the
file child before). Should there really be some issue, instead of
having a drop state, this could also be just based off the fact
whether there is still a backing child.
Dumping the cumulative (shared) permissions for the BDS with a debug
print yields the same values after this patch and with QEMU 8.1,
namely 3 and 5.
* PBS block driver: compile unconditionally. Proxmox VE always needs
it and something in the build process changed to make it not enabled
by default. Probably would need to move the build option to meson
otherwise.
* backup: job unreferencing during cleanup needs to happen outside of
coroutine, so it was moved to before invoking the clean
* mirror: Cherry-pick stable fix to avoid potential deadlock.
* savevm-async: migrate_init now can fail, so propagate potential
error.
* savevm-async: compression counters are not accessible outside
migration/ram-compress now, so drop code that prophylactically set
it to zero.
[0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/
[1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
|
|
|
@@ -1875,3 +1875,20 @@ SRST
|
update submodule and patches to QEMU 8.0.0
Many changes were necessary this time around:
* QAPI was changed to avoid redundant has_* variables, see commit
44ea9d9be3 ("qapi: Start to elide redundant has_FOO in generated C")
for details. This affected many QMP commands added by Proxmox too.
* Pending querying for migration got split into two functions, one to
estimate, one for exact value, see commit c8df4a7aef ("migration:
Split save_live_pending() into state_pending_*") for details. Relevant
for savevm-async and PBS dirty bitmap.
* Some block (driver) functions got converted to coroutines, so the
Proxmox block drivers needed to be adapted.
* Alloc track auto-detaching during PBS live restore got broken by
AioContext-related changes resulting in a deadlock. The current, hacky
method was replaced by a simpler one. Stefan apparently ran into a
problem with that when he wrote the driver, but there were
improvements in the stream job code since then and I didn't manage to
reproduce the issue. It's a separate patch "alloc-track: fix deadlock
during drop" for now, you can find the details there.
* Async snapshot-related changes:
- The pending querying got adapted to the above-mentioned split and
a patch is added to optimize it/make it more similar to what
upstream code does.
- Added initialization of the compression counters (for
future-proofing).
- It's necessary the hold the BQL (big QEMU lock = iothread mutex)
during the setup phase, because block layer functions are used there
and not doing so leads to racy, hard-to-debug crashes or hangs. It's
necessary to change some upstream code too for this, a version of
the patch "migration: for snapshots, hold the BQL during setup
callbacks" is intended to be upstreamed.
- Need to take the bdrv graph read lock before flushing.
* hmp_info_balloon was moved to a different file.
* Needed to include a new headers from time to time to still get the
correct functions.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2023-05-15 16:39:53 +03:00
|
|
|
List event channels in the guest
|
2022-12-14 17:16:32 +03:00
|
|
|
ERST
|
|
|
|
#endif
|
2017-04-05 11:49:19 +03:00
|
|
|
+
|
|
|
|
+ {
|
|
|
|
+ .name = "savevm-start",
|
|
|
|
+ .args_type = "statefile:s?",
|
|
|
|
+ .params = "[statefile]",
|
|
|
|
+ .help = "Prepare for snapshot and halt VM. Save VM state to statefile.",
|
2017-04-05 12:38:26 +03:00
|
|
|
+ .cmd = hmp_savevm_start,
|
2017-04-05 11:49:19 +03:00
|
|
|
+ },
|
|
|
|
+
|
|
|
|
+ {
|
|
|
|
+ .name = "savevm-end",
|
|
|
|
+ .args_type = "",
|
|
|
|
+ .params = "",
|
|
|
|
+ .help = "Resume VM after snaphot.",
|
2021-02-11 19:11:11 +03:00
|
|
|
+ .cmd = hmp_savevm_end,
|
|
|
|
+ .coroutine = true,
|
2017-04-05 11:49:19 +03:00
|
|
|
+ },
|
2019-11-20 17:45:35 +03:00
|
|
|
diff --git a/include/migration/snapshot.h b/include/migration/snapshot.h
|
2021-05-27 13:43:32 +03:00
|
|
|
index e72083b117..c846d37806 100644
|
2019-11-20 17:45:35 +03:00
|
|
|
--- a/include/migration/snapshot.h
|
|
|
|
+++ b/include/migration/snapshot.h
|
2021-05-27 13:43:32 +03:00
|
|
|
@@ -61,4 +61,6 @@ bool delete_snapshot(const char *name,
|
|
|
|
bool has_devices, strList *devices,
|
|
|
|
Error **errp);
|
2019-11-20 17:45:35 +03:00
|
|
|
|
|
|
|
+int load_snapshot_from_blockdev(const char *filename, Error **errp);
|
2021-05-27 13:43:32 +03:00
|
|
|
+
|
2019-11-20 17:45:35 +03:00
|
|
|
#endif
|
|
|
|
diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h
|
2023-10-17 15:10:09 +03:00
|
|
|
index 13f9a2dedb..7a7def7530 100644
|
2019-11-20 17:45:35 +03:00
|
|
|
--- a/include/monitor/hmp.h
|
|
|
|
+++ b/include/monitor/hmp.h
|
update submodule and patches to QEMU 8.0.0
Many changes were necessary this time around:
* QAPI was changed to avoid redundant has_* variables, see commit
44ea9d9be3 ("qapi: Start to elide redundant has_FOO in generated C")
for details. This affected many QMP commands added by Proxmox too.
* Pending querying for migration got split into two functions, one to
estimate, one for exact value, see commit c8df4a7aef ("migration:
Split save_live_pending() into state_pending_*") for details. Relevant
for savevm-async and PBS dirty bitmap.
* Some block (driver) functions got converted to coroutines, so the
Proxmox block drivers needed to be adapted.
* Alloc track auto-detaching during PBS live restore got broken by
AioContext-related changes resulting in a deadlock. The current, hacky
method was replaced by a simpler one. Stefan apparently ran into a
problem with that when he wrote the driver, but there were
improvements in the stream job code since then and I didn't manage to
reproduce the issue. It's a separate patch "alloc-track: fix deadlock
during drop" for now, you can find the details there.
* Async snapshot-related changes:
- The pending querying got adapted to the above-mentioned split and
a patch is added to optimize it/make it more similar to what
upstream code does.
- Added initialization of the compression counters (for
future-proofing).
- It's necessary the hold the BQL (big QEMU lock = iothread mutex)
during the setup phase, because block layer functions are used there
and not doing so leads to racy, hard-to-debug crashes or hangs. It's
necessary to change some upstream code too for this, a version of
the patch "migration: for snapshots, hold the BQL during setup
callbacks" is intended to be upstreamed.
- Need to take the bdrv graph read lock before flushing.
* hmp_info_balloon was moved to a different file.
* Needed to include a new headers from time to time to still get the
correct functions.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2023-05-15 16:39:53 +03:00
|
|
|
@@ -28,6 +28,7 @@ void hmp_info_status(Monitor *mon, const QDict *qdict);
|
2019-11-20 17:45:35 +03:00
|
|
|
void hmp_info_uuid(Monitor *mon, const QDict *qdict);
|
|
|
|
void hmp_info_chardev(Monitor *mon, const QDict *qdict);
|
|
|
|
void hmp_info_mice(Monitor *mon, const QDict *qdict);
|
|
|
|
+void hmp_info_savevm(Monitor *mon, const QDict *qdict);
|
|
|
|
void hmp_info_migrate(Monitor *mon, const QDict *qdict);
|
|
|
|
void hmp_info_migrate_capabilities(Monitor *mon, const QDict *qdict);
|
|
|
|
void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict);
|
2023-05-24 16:56:53 +03:00
|
|
|
@@ -94,6 +95,8 @@ void hmp_closefd(Monitor *mon, const QDict *qdict);
|
update submodule and patches to QEMU 8.0.0
Many changes were necessary this time around:
* QAPI was changed to avoid redundant has_* variables, see commit
44ea9d9be3 ("qapi: Start to elide redundant has_FOO in generated C")
for details. This affected many QMP commands added by Proxmox too.
* Pending querying for migration got split into two functions, one to
estimate, one for exact value, see commit c8df4a7aef ("migration:
Split save_live_pending() into state_pending_*") for details. Relevant
for savevm-async and PBS dirty bitmap.
* Some block (driver) functions got converted to coroutines, so the
Proxmox block drivers needed to be adapted.
* Alloc track auto-detaching during PBS live restore got broken by
AioContext-related changes resulting in a deadlock. The current, hacky
method was replaced by a simpler one. Stefan apparently ran into a
problem with that when he wrote the driver, but there were
improvements in the stream job code since then and I didn't manage to
reproduce the issue. It's a separate patch "alloc-track: fix deadlock
during drop" for now, you can find the details there.
* Async snapshot-related changes:
- The pending querying got adapted to the above-mentioned split and
a patch is added to optimize it/make it more similar to what
upstream code does.
- Added initialization of the compression counters (for
future-proofing).
- It's necessary the hold the BQL (big QEMU lock = iothread mutex)
during the setup phase, because block layer functions are used there
and not doing so leads to racy, hard-to-debug crashes or hangs. It's
necessary to change some upstream code too for this, a version of
the patch "migration: for snapshots, hold the BQL during setup
callbacks" is intended to be upstreamed.
- Need to take the bdrv graph read lock before flushing.
* hmp_info_balloon was moved to a different file.
* Needed to include a new headers from time to time to still get the
correct functions.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2023-05-15 16:39:53 +03:00
|
|
|
void hmp_mouse_move(Monitor *mon, const QDict *qdict);
|
|
|
|
void hmp_mouse_button(Monitor *mon, const QDict *qdict);
|
|
|
|
void hmp_mouse_set(Monitor *mon, const QDict *qdict);
|
2019-11-20 17:45:35 +03:00
|
|
|
+void hmp_savevm_start(Monitor *mon, const QDict *qdict);
|
|
|
|
+void hmp_savevm_end(Monitor *mon, const QDict *qdict);
|
|
|
|
void hmp_sendkey(Monitor *mon, const QDict *qdict);
|
2022-12-14 17:16:32 +03:00
|
|
|
void coroutine_fn hmp_screendump(Monitor *mon, const QDict *qdict);
|
2020-04-07 17:53:19 +03:00
|
|
|
void hmp_chardev_add(Monitor *mon, const QDict *qdict);
|
2021-02-11 19:11:11 +03:00
|
|
|
diff --git a/migration/meson.build b/migration/meson.build
|
update submodule and patches to QEMU 8.2.2
This version includes both the AioContext lock and the block graph
lock, so there might be some deadlocks lurking. It's not possible to
disable the block graph lock like was done in QEMU 8.1, because there
are no changes like the function bdrv_schedule_unref() that require
it. QEMU 9.0 will finally get rid of the AioContext locking.
During live-restore with a VirtIO SCSI drive with iothread there is a
known racy deadlock related to the AioContext lock. Not new [1], but
not sure if more likely now. Should be fixed in QEMU 9.0.
The block graph lock comes with annotations that can be checked by
clang's TSA. This required changes to the block drivers, i.e.
alloc-track, pbs, zeroinit as well as taking the appropriate locks
in pve-backup, savevm-async, vma-reader.
Local variable shadowing is prohibited via a compiler flag now,
required slight adaptation in vma.c.
Major changes only affect alloc-track:
* It is not possible to call a generated co-wrapper like
bdrv_get_info() while holding the block graph lock exclusively [0],
which does happen during initialization of alloc-track when the
backing hd is set and the refresh_limits driver callback is invoked.
The bdrv_get_info() call to get the cluster size is moved to
directly after opening the file child in track_open().
The important thing is that at least the request alignment for the
write target is used, because then the RMW cycle in bdrv_pwritev
will gather enough data from the backing file. Partial cluster
allocations in the target are not a fundamental issue, because the
driver returns its allocation status based on the bitmap, so any
other data that maps to the same cluster will still be copied later
by a stream job (or during writes to that cluster).
* Replacing the node cannot be done in the
track_co_change_backing_file() callback, because it is a coroutine
and cannot hold the block graph lock exclusively. So it is moved to
the stream job itself with the auto-remove option not having an
effect anymore (qemu-server would always set it anyways).
In the future, there could either be a special option for the stream
job, or maybe the upcoming blockdev-replace QMP command can be used.
Replacing the backing child is actually already done in the stream
job, so no need to do it in the track_co_change_backing_file()
callback. It also cannot be called from a coroutine. Looking at the
implementation in the qcow2 driver, it doesn't seem to be intended
to change the backing child itself, just update driver-internal
state.
Other changes:
* alloc-track: Error out early when used without auto-remove. Since
replacing the node now happens in the stream job, where the option
cannot be read from (it's internal to the driver), it will always be
treated as 'on'. Makes sure to have users beside qemu-server notice
the change (should they even exist). The option can be fully dropped
in the future while adding a version guard in qemu-server.
* alloc-track: Avoid seemingly superfluous child permission update.
Doesn't seem necessary nowadays (maybe after commit "alloc-track:
fix deadlock during drop" where the dropping is not rescheduled and
delayed anymore or some upstream change). Replacing the block node
will already update the permissions of the new node (which was the
file child before). Should there really be some issue, instead of
having a drop state, this could also be just based off the fact
whether there is still a backing child.
Dumping the cumulative (shared) permissions for the BDS with a debug
print yields the same values after this patch and with QEMU 8.1,
namely 3 and 5.
* PBS block driver: compile unconditionally. Proxmox VE always needs
it and something in the build process changed to make it not enabled
by default. Probably would need to move the build option to meson
otherwise.
* backup: job unreferencing during cleanup needs to happen outside of
coroutine, so it was moved to before invoking the clean
* mirror: Cherry-pick stable fix to avoid potential deadlock.
* savevm-async: migrate_init now can fail, so propagate potential
error.
* savevm-async: compression counters are not accessible outside
migration/ram-compress now, so drop code that prophylactically set
it to zero.
[0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/
[1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
|
|
|
index 0e689eac09..8f9d122187 100644
|
2021-02-11 19:11:11 +03:00
|
|
|
--- a/migration/meson.build
|
|
|
|
+++ b/migration/meson.build
|
update submodule and patches to QEMU 8.2.2
This version includes both the AioContext lock and the block graph
lock, so there might be some deadlocks lurking. It's not possible to
disable the block graph lock like was done in QEMU 8.1, because there
are no changes like the function bdrv_schedule_unref() that require
it. QEMU 9.0 will finally get rid of the AioContext locking.
During live-restore with a VirtIO SCSI drive with iothread there is a
known racy deadlock related to the AioContext lock. Not new [1], but
not sure if more likely now. Should be fixed in QEMU 9.0.
The block graph lock comes with annotations that can be checked by
clang's TSA. This required changes to the block drivers, i.e.
alloc-track, pbs, zeroinit as well as taking the appropriate locks
in pve-backup, savevm-async, vma-reader.
Local variable shadowing is prohibited via a compiler flag now,
required slight adaptation in vma.c.
Major changes only affect alloc-track:
* It is not possible to call a generated co-wrapper like
bdrv_get_info() while holding the block graph lock exclusively [0],
which does happen during initialization of alloc-track when the
backing hd is set and the refresh_limits driver callback is invoked.
The bdrv_get_info() call to get the cluster size is moved to
directly after opening the file child in track_open().
The important thing is that at least the request alignment for the
write target is used, because then the RMW cycle in bdrv_pwritev
will gather enough data from the backing file. Partial cluster
allocations in the target are not a fundamental issue, because the
driver returns its allocation status based on the bitmap, so any
other data that maps to the same cluster will still be copied later
by a stream job (or during writes to that cluster).
* Replacing the node cannot be done in the
track_co_change_backing_file() callback, because it is a coroutine
and cannot hold the block graph lock exclusively. So it is moved to
the stream job itself with the auto-remove option not having an
effect anymore (qemu-server would always set it anyways).
In the future, there could either be a special option for the stream
job, or maybe the upcoming blockdev-replace QMP command can be used.
Replacing the backing child is actually already done in the stream
job, so no need to do it in the track_co_change_backing_file()
callback. It also cannot be called from a coroutine. Looking at the
implementation in the qcow2 driver, it doesn't seem to be intended
to change the backing child itself, just update driver-internal
state.
Other changes:
* alloc-track: Error out early when used without auto-remove. Since
replacing the node now happens in the stream job, where the option
cannot be read from (it's internal to the driver), it will always be
treated as 'on'. Makes sure to have users beside qemu-server notice
the change (should they even exist). The option can be fully dropped
in the future while adding a version guard in qemu-server.
* alloc-track: Avoid seemingly superfluous child permission update.
Doesn't seem necessary nowadays (maybe after commit "alloc-track:
fix deadlock during drop" where the dropping is not rescheduled and
delayed anymore or some upstream change). Replacing the block node
will already update the permissions of the new node (which was the
file child before). Should there really be some issue, instead of
having a drop state, this could also be just based off the fact
whether there is still a backing child.
Dumping the cumulative (shared) permissions for the BDS with a debug
print yields the same values after this patch and with QEMU 8.1,
namely 3 and 5.
* PBS block driver: compile unconditionally. Proxmox VE always needs
it and something in the build process changed to make it not enabled
by default. Probably would need to move the build option to meson
otherwise.
* backup: job unreferencing during cleanup needs to happen outside of
coroutine, so it was moved to before invoking the clean
* mirror: Cherry-pick stable fix to avoid potential deadlock.
* savevm-async: migrate_init now can fail, so propagate potential
error.
* savevm-async: compression counters are not accessible outside
migration/ram-compress now, so drop code that prophylactically set
it to zero.
[0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/
[1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
|
|
|
@@ -27,6 +27,7 @@ system_ss.add(files(
|
2023-10-17 15:10:09 +03:00
|
|
|
'options.c',
|
2021-02-11 19:11:11 +03:00
|
|
|
'postcopy-ram.c',
|
|
|
|
'savevm.c',
|
|
|
|
+ 'savevm-async.c',
|
|
|
|
'socket.c',
|
|
|
|
'tls.c',
|
update submodule and patches to QEMU 8.0.0
Many changes were necessary this time around:
* QAPI was changed to avoid redundant has_* variables, see commit
44ea9d9be3 ("qapi: Start to elide redundant has_FOO in generated C")
for details. This affected many QMP commands added by Proxmox too.
* Pending querying for migration got split into two functions, one to
estimate, one for exact value, see commit c8df4a7aef ("migration:
Split save_live_pending() into state_pending_*") for details. Relevant
for savevm-async and PBS dirty bitmap.
* Some block (driver) functions got converted to coroutines, so the
Proxmox block drivers needed to be adapted.
* Alloc track auto-detaching during PBS live restore got broken by
AioContext-related changes resulting in a deadlock. The current, hacky
method was replaced by a simpler one. Stefan apparently ran into a
problem with that when he wrote the driver, but there were
improvements in the stream job code since then and I didn't manage to
reproduce the issue. It's a separate patch "alloc-track: fix deadlock
during drop" for now, you can find the details there.
* Async snapshot-related changes:
- The pending querying got adapted to the above-mentioned split and
a patch is added to optimize it/make it more similar to what
upstream code does.
- Added initialization of the compression counters (for
future-proofing).
- It's necessary the hold the BQL (big QEMU lock = iothread mutex)
during the setup phase, because block layer functions are used there
and not doing so leads to racy, hard-to-debug crashes or hangs. It's
necessary to change some upstream code too for this, a version of
the patch "migration: for snapshots, hold the BQL during setup
callbacks" is intended to be upstreamed.
- Need to take the bdrv graph read lock before flushing.
* hmp_info_balloon was moved to a different file.
* Needed to include a new headers from time to time to still get the
correct functions.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2023-05-15 16:39:53 +03:00
|
|
|
'threadinfo.c',
|
2021-02-11 19:11:11 +03:00
|
|
|
diff --git a/migration/savevm-async.c b/migration/savevm-async.c
|
2017-04-05 11:49:19 +03:00
|
|
|
new file mode 100644
|
update submodule and patches to QEMU 8.2.2
This version includes both the AioContext lock and the block graph
lock, so there might be some deadlocks lurking. It's not possible to
disable the block graph lock like was done in QEMU 8.1, because there
are no changes like the function bdrv_schedule_unref() that require
it. QEMU 9.0 will finally get rid of the AioContext locking.
During live-restore with a VirtIO SCSI drive with iothread there is a
known racy deadlock related to the AioContext lock. Not new [1], but
not sure if more likely now. Should be fixed in QEMU 9.0.
The block graph lock comes with annotations that can be checked by
clang's TSA. This required changes to the block drivers, i.e.
alloc-track, pbs, zeroinit as well as taking the appropriate locks
in pve-backup, savevm-async, vma-reader.
Local variable shadowing is prohibited via a compiler flag now,
required slight adaptation in vma.c.
Major changes only affect alloc-track:
* It is not possible to call a generated co-wrapper like
bdrv_get_info() while holding the block graph lock exclusively [0],
which does happen during initialization of alloc-track when the
backing hd is set and the refresh_limits driver callback is invoked.
The bdrv_get_info() call to get the cluster size is moved to
directly after opening the file child in track_open().
The important thing is that at least the request alignment for the
write target is used, because then the RMW cycle in bdrv_pwritev
will gather enough data from the backing file. Partial cluster
allocations in the target are not a fundamental issue, because the
driver returns its allocation status based on the bitmap, so any
other data that maps to the same cluster will still be copied later
by a stream job (or during writes to that cluster).
* Replacing the node cannot be done in the
track_co_change_backing_file() callback, because it is a coroutine
and cannot hold the block graph lock exclusively. So it is moved to
the stream job itself with the auto-remove option not having an
effect anymore (qemu-server would always set it anyways).
In the future, there could either be a special option for the stream
job, or maybe the upcoming blockdev-replace QMP command can be used.
Replacing the backing child is actually already done in the stream
job, so no need to do it in the track_co_change_backing_file()
callback. It also cannot be called from a coroutine. Looking at the
implementation in the qcow2 driver, it doesn't seem to be intended
to change the backing child itself, just update driver-internal
state.
Other changes:
* alloc-track: Error out early when used without auto-remove. Since
replacing the node now happens in the stream job, where the option
cannot be read from (it's internal to the driver), it will always be
treated as 'on'. Makes sure to have users beside qemu-server notice
the change (should they even exist). The option can be fully dropped
in the future while adding a version guard in qemu-server.
* alloc-track: Avoid seemingly superfluous child permission update.
Doesn't seem necessary nowadays (maybe after commit "alloc-track:
fix deadlock during drop" where the dropping is not rescheduled and
delayed anymore or some upstream change). Replacing the block node
will already update the permissions of the new node (which was the
file child before). Should there really be some issue, instead of
having a drop state, this could also be just based off the fact
whether there is still a backing child.
Dumping the cumulative (shared) permissions for the BDS with a debug
print yields the same values after this patch and with QEMU 8.1,
namely 3 and 5.
* PBS block driver: compile unconditionally. Proxmox VE always needs
it and something in the build process changed to make it not enabled
by default. Probably would need to move the build option to meson
otherwise.
* backup: job unreferencing during cleanup needs to happen outside of
coroutine, so it was moved to before invoking the clean
* mirror: Cherry-pick stable fix to avoid potential deadlock.
* savevm-async: migrate_init now can fail, so propagate potential
error.
* savevm-async: compression counters are not accessible outside
migration/ram-compress now, so drop code that prophylactically set
it to zero.
[0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/
[1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
|
|
|
index 0000000000..8f63c4c637
|
2017-04-05 11:49:19 +03:00
|
|
|
--- /dev/null
|
2021-02-11 19:11:11 +03:00
|
|
|
+++ b/migration/savevm-async.c
|
update submodule and patches to QEMU 8.2.2
This version includes both the AioContext lock and the block graph
lock, so there might be some deadlocks lurking. It's not possible to
disable the block graph lock like was done in QEMU 8.1, because there
are no changes like the function bdrv_schedule_unref() that require
it. QEMU 9.0 will finally get rid of the AioContext locking.
During live-restore with a VirtIO SCSI drive with iothread there is a
known racy deadlock related to the AioContext lock. Not new [1], but
not sure if more likely now. Should be fixed in QEMU 9.0.
The block graph lock comes with annotations that can be checked by
clang's TSA. This required changes to the block drivers, i.e.
alloc-track, pbs, zeroinit as well as taking the appropriate locks
in pve-backup, savevm-async, vma-reader.
Local variable shadowing is prohibited via a compiler flag now,
required slight adaptation in vma.c.
Major changes only affect alloc-track:
* It is not possible to call a generated co-wrapper like
bdrv_get_info() while holding the block graph lock exclusively [0],
which does happen during initialization of alloc-track when the
backing hd is set and the refresh_limits driver callback is invoked.
The bdrv_get_info() call to get the cluster size is moved to
directly after opening the file child in track_open().
The important thing is that at least the request alignment for the
write target is used, because then the RMW cycle in bdrv_pwritev
will gather enough data from the backing file. Partial cluster
allocations in the target are not a fundamental issue, because the
driver returns its allocation status based on the bitmap, so any
other data that maps to the same cluster will still be copied later
by a stream job (or during writes to that cluster).
* Replacing the node cannot be done in the
track_co_change_backing_file() callback, because it is a coroutine
and cannot hold the block graph lock exclusively. So it is moved to
the stream job itself with the auto-remove option not having an
effect anymore (qemu-server would always set it anyways).
In the future, there could either be a special option for the stream
job, or maybe the upcoming blockdev-replace QMP command can be used.
Replacing the backing child is actually already done in the stream
job, so no need to do it in the track_co_change_backing_file()
callback. It also cannot be called from a coroutine. Looking at the
implementation in the qcow2 driver, it doesn't seem to be intended
to change the backing child itself, just update driver-internal
state.
Other changes:
* alloc-track: Error out early when used without auto-remove. Since
replacing the node now happens in the stream job, where the option
cannot be read from (it's internal to the driver), it will always be
treated as 'on'. Makes sure to have users beside qemu-server notice
the change (should they even exist). The option can be fully dropped
in the future while adding a version guard in qemu-server.
* alloc-track: Avoid seemingly superfluous child permission update.
Doesn't seem necessary nowadays (maybe after commit "alloc-track:
fix deadlock during drop" where the dropping is not rescheduled and
delayed anymore or some upstream change). Replacing the block node
will already update the permissions of the new node (which was the
file child before). Should there really be some issue, instead of
having a drop state, this could also be just based off the fact
whether there is still a backing child.
Dumping the cumulative (shared) permissions for the BDS with a debug
print yields the same values after this patch and with QEMU 8.1,
namely 3 and 5.
* PBS block driver: compile unconditionally. Proxmox VE always needs
it and something in the build process changed to make it not enabled
by default. Probably would need to move the build option to meson
otherwise.
* backup: job unreferencing during cleanup needs to happen outside of
coroutine, so it was moved to before invoking the clean
* mirror: Cherry-pick stable fix to avoid potential deadlock.
* savevm-async: migrate_init now can fail, so propagate potential
error.
* savevm-async: compression counters are not accessible outside
migration/ram-compress now, so drop code that prophylactically set
it to zero.
[0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/
[1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
|
|
|
@@ -0,0 +1,534 @@
|
2017-04-05 11:49:19 +03:00
|
|
|
+#include "qemu/osdep.h"
|
update submodule and patches to 7.1.0
Notable changes:
* The only big change is the switch to using a custom QIOChannel for
savevm-async, because the previously used QEMUFileOps was dropped.
Changes to the current implementation:
* Switch to vector based methods as required for an IO channel. For
short reads the passed-in IO vector is stuffed with zeroes at the
end, just to be sure.
* For reading: The documentation in include/io/channel.h states that
at least one byte should be read, so also error out when whe are
at the very end instead of returning 0.
* For reading: Fix off-by-one error when request goes beyond end.
The wrong code piece was:
if ((pos + size) > maxlen) {
size = maxlen - pos - 1;
}
Previously, the last byte would not be read. It's actually
possible to get a snapshot .raw file that has content all the way
up the final 512 byte (= BDRV_SECTOR_SIZE) boundary without any
trailing zero bytes (I wrote a script to do it).
Luckily, it didn't cause a real issue, because qemu_loadvm_state()
is not interested in the final (i.e. QEMU_VM_VMDESCRIPTION)
section. The buffer for reading it is simply freed up afterwards
and the function will assume that it read the whole section, even
if that's not the case.
* For writing: Make use of the generated blk_pwritev() wrapper
instead of manually wrapping the coroutine to simplify and save a
few lines.
* Adapt to changed interfaces for blk_{pread,pwrite}:
* a9262f551e ("block: Change blk_{pread,pwrite}() param order")
* 3b35d4542c ("block: Add a 'flags' param to blk_pread()")
* bf5b16fa40 ("block: Make blk_{pread,pwrite}() return 0 on success")
Those changes especially affected the qemu-img dd patches, because
the context also changed, but also some of our block drivers used
the functions.
* Drop qemu-common.h include: it got renamed after essentially
everything was moved to other headers. The only remaining user I
could find for things dropped from the header between 7.0 and 7.1
was qemu_get_vm_name() in the iscsi-initiatorname patch, but it
already includes the header to which the function was moved.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2022-10-14 15:07:13 +03:00
|
|
|
+#include "migration/channel-savevm-async.h"
|
2018-02-22 14:34:57 +03:00
|
|
|
+#include "migration/migration.h"
|
2023-10-17 15:10:09 +03:00
|
|
|
+#include "migration/migration-stats.h"
|
|
|
|
+#include "migration/options.h"
|
2018-02-22 14:34:57 +03:00
|
|
|
+#include "migration/savevm.h"
|
|
|
|
+#include "migration/snapshot.h"
|
|
|
|
+#include "migration/global_state.h"
|
|
|
|
+#include "migration/ram.h"
|
|
|
|
+#include "migration/qemu-file.h"
|
2017-04-05 11:49:19 +03:00
|
|
|
+#include "sysemu/sysemu.h"
|
2020-03-10 17:12:50 +03:00
|
|
|
+#include "sysemu/runstate.h"
|
2017-04-05 11:49:19 +03:00
|
|
|
+#include "block/block.h"
|
|
|
|
+#include "sysemu/block-backend.h"
|
2018-08-30 16:00:07 +03:00
|
|
|
+#include "qapi/error.h"
|
|
|
|
+#include "qapi/qmp/qerror.h"
|
|
|
|
+#include "qapi/qmp/qdict.h"
|
|
|
|
+#include "qapi/qapi-commands-migration.h"
|
|
|
|
+#include "qapi/qapi-commands-misc.h"
|
2019-04-19 10:53:37 +03:00
|
|
|
+#include "qapi/qapi-commands-block.h"
|
2017-04-05 11:49:19 +03:00
|
|
|
+#include "qemu/cutils.h"
|
2021-02-11 19:11:11 +03:00
|
|
|
+#include "qemu/timer.h"
|
2020-03-10 17:12:50 +03:00
|
|
|
+#include "qemu/main-loop.h"
|
|
|
|
+#include "qemu/rcu.h"
|
squash related patches
where there is no good reason to keep them separate. It's a pain
during rebase if there are multiple patches changing the same code
over and over again. This was especially bad for the backup-related
patches. If the history of patches really is needed, it can be
extracted via git. Additionally, compilation with partial application
of patches was broken since a long time, because one of the master key
changes became part of an earlier patch during a past rebase.
If only the same files were changed by a subsequent patch and the
changes felt to belong together (obvious for later bug fixes, but also
done for features e.g. adding master key support for PBS), the patches
were squashed together.
The PBS namespace support patch was split into the individual parts
it changes, i.e. PBS block driver, pbs-restore binary and QMP backup
infrastructure, and squashed into the respective patches.
No code change is intended, git diff in the submodule should not show
any difference between applying all patches before this commit and
applying all patches after this commit.
The query-proxmox-support QMP function has been left as part of the
"PVE-Backup: Proxmox backup patches for QEMU" patch, because it's
currently only used there. If it ever is used elsewhere too, it can
be split out from there.
The recent alloc-track and BQL-related savevm-async changes have been
left separate for now, because it's not 100% clear they are the best
approach yet. This depends on what upstream decides about the BQL
stuff and whether and what kind of issues with the changes pop up.
The qemu-img dd snapshot patch has been re-ordered to after the other
qemu-img dd patches.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2023-05-15 16:39:56 +03:00
|
|
|
+#include "qemu/yank.h"
|
2017-04-05 11:49:19 +03:00
|
|
|
+
|
|
|
|
+/* #define DEBUG_SAVEVM_STATE */
|
|
|
|
+
|
|
|
|
+#ifdef DEBUG_SAVEVM_STATE
|
|
|
|
+#define DPRINTF(fmt, ...) \
|
|
|
|
+ do { printf("savevm-async: " fmt, ## __VA_ARGS__); } while (0)
|
|
|
|
+#else
|
|
|
|
+#define DPRINTF(fmt, ...) \
|
|
|
|
+ do { } while (0)
|
|
|
|
+#endif
|
|
|
|
+
|
|
|
|
+enum {
|
|
|
|
+ SAVE_STATE_DONE,
|
|
|
|
+ SAVE_STATE_ERROR,
|
|
|
|
+ SAVE_STATE_ACTIVE,
|
|
|
|
+ SAVE_STATE_COMPLETED,
|
|
|
|
+ SAVE_STATE_CANCELLED
|
|
|
|
+};
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+static struct SnapshotState {
|
2017-08-07 10:10:07 +03:00
|
|
|
+ BlockBackend *target;
|
2017-04-05 11:49:19 +03:00
|
|
|
+ size_t bs_pos;
|
|
|
|
+ int state;
|
|
|
|
+ Error *error;
|
|
|
|
+ Error *blocker;
|
|
|
|
+ int saved_vm_running;
|
|
|
|
+ QEMUFile *file;
|
|
|
|
+ int64_t total_time;
|
2020-07-02 14:07:28 +03:00
|
|
|
+ QEMUBH *finalize_bh;
|
|
|
|
+ Coroutine *co;
|
2022-08-18 14:44:16 +03:00
|
|
|
+ QemuCoSleep target_close_wait;
|
2017-04-05 11:49:19 +03:00
|
|
|
+} snap_state;
|
|
|
|
+
|
2021-02-11 19:11:11 +03:00
|
|
|
+static bool savevm_aborted(void)
|
|
|
|
+{
|
|
|
|
+ return snap_state.state == SAVE_STATE_CANCELLED ||
|
|
|
|
+ snap_state.state == SAVE_STATE_ERROR;
|
|
|
|
+}
|
|
|
|
+
|
2017-04-05 11:49:19 +03:00
|
|
|
+SaveVMInfo *qmp_query_savevm(Error **errp)
|
|
|
|
+{
|
|
|
|
+ SaveVMInfo *info = g_malloc0(sizeof(*info));
|
|
|
|
+ struct SnapshotState *s = &snap_state;
|
|
|
|
+
|
|
|
|
+ if (s->state != SAVE_STATE_DONE) {
|
|
|
|
+ info->has_bytes = true;
|
|
|
|
+ info->bytes = s->bs_pos;
|
|
|
|
+ switch (s->state) {
|
|
|
|
+ case SAVE_STATE_ERROR:
|
|
|
|
+ info->status = g_strdup("failed");
|
|
|
|
+ info->has_total_time = true;
|
|
|
|
+ info->total_time = s->total_time;
|
|
|
|
+ if (s->error) {
|
|
|
|
+ info->error = g_strdup(error_get_pretty(s->error));
|
|
|
|
+ }
|
|
|
|
+ break;
|
|
|
|
+ case SAVE_STATE_ACTIVE:
|
|
|
|
+ info->status = g_strdup("active");
|
|
|
|
+ info->has_total_time = true;
|
|
|
|
+ info->total_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME)
|
|
|
|
+ - s->total_time;
|
|
|
|
+ break;
|
|
|
|
+ case SAVE_STATE_COMPLETED:
|
|
|
|
+ info->status = g_strdup("completed");
|
|
|
|
+ info->has_total_time = true;
|
|
|
|
+ info->total_time = s->total_time;
|
|
|
|
+ break;
|
|
|
|
+ }
|
|
|
|
+ }
|
|
|
|
+
|
|
|
|
+ return info;
|
|
|
|
+}
|
|
|
|
+
|
|
|
|
+static int save_snapshot_cleanup(void)
|
|
|
|
+{
|
|
|
|
+ int ret = 0;
|
|
|
|
+
|
|
|
|
+ DPRINTF("save_snapshot_cleanup\n");
|
|
|
|
+
|
|
|
|
+ snap_state.total_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME) -
|
|
|
|
+ snap_state.total_time;
|
|
|
|
+
|
|
|
|
+ if (snap_state.file) {
|
|
|
|
+ ret = qemu_fclose(snap_state.file);
|
update submodule and patches to 7.1.0
Notable changes:
* The only big change is the switch to using a custom QIOChannel for
savevm-async, because the previously used QEMUFileOps was dropped.
Changes to the current implementation:
* Switch to vector based methods as required for an IO channel. For
short reads the passed-in IO vector is stuffed with zeroes at the
end, just to be sure.
* For reading: The documentation in include/io/channel.h states that
at least one byte should be read, so also error out when whe are
at the very end instead of returning 0.
* For reading: Fix off-by-one error when request goes beyond end.
The wrong code piece was:
if ((pos + size) > maxlen) {
size = maxlen - pos - 1;
}
Previously, the last byte would not be read. It's actually
possible to get a snapshot .raw file that has content all the way
up the final 512 byte (= BDRV_SECTOR_SIZE) boundary without any
trailing zero bytes (I wrote a script to do it).
Luckily, it didn't cause a real issue, because qemu_loadvm_state()
is not interested in the final (i.e. QEMU_VM_VMDESCRIPTION)
section. The buffer for reading it is simply freed up afterwards
and the function will assume that it read the whole section, even
if that's not the case.
* For writing: Make use of the generated blk_pwritev() wrapper
instead of manually wrapping the coroutine to simplify and save a
few lines.
* Adapt to changed interfaces for blk_{pread,pwrite}:
* a9262f551e ("block: Change blk_{pread,pwrite}() param order")
* 3b35d4542c ("block: Add a 'flags' param to blk_pread()")
* bf5b16fa40 ("block: Make blk_{pread,pwrite}() return 0 on success")
Those changes especially affected the qemu-img dd patches, because
the context also changed, but also some of our block drivers used
the functions.
* Drop qemu-common.h include: it got renamed after essentially
everything was moved to other headers. The only remaining user I
could find for things dropped from the header between 7.0 and 7.1
was qemu_get_vm_name() in the iscsi-initiatorname patch, but it
already includes the header to which the function was moved.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2022-10-14 15:07:13 +03:00
|
|
|
+ snap_state.file = NULL;
|
2017-04-05 11:49:19 +03:00
|
|
|
+ }
|
|
|
|
+
|
2017-08-07 10:10:07 +03:00
|
|
|
+ if (snap_state.target) {
|
2021-02-11 19:11:11 +03:00
|
|
|
+ if (!savevm_aborted()) {
|
|
|
|
+ /* try to truncate, but ignore errors (will fail on block devices).
|
|
|
|
+ * note1: bdrv_read() need whole blocks, so we need to round up
|
|
|
|
+ * note2: PVE requires 1024 (BDRV_SECTOR_SIZE*2) alignment
|
|
|
|
+ */
|
|
|
|
+ size_t size = QEMU_ALIGN_UP(snap_state.bs_pos, BDRV_SECTOR_SIZE*2);
|
|
|
|
+ blk_truncate(snap_state.target, size, false, PREALLOC_MODE_OFF, 0, NULL);
|
|
|
|
+ }
|
2017-08-07 10:10:07 +03:00
|
|
|
+ blk_op_unblock_all(snap_state.target, snap_state.blocker);
|
2017-04-05 11:49:19 +03:00
|
|
|
+ error_free(snap_state.blocker);
|
|
|
|
+ snap_state.blocker = NULL;
|
2017-08-07 10:10:07 +03:00
|
|
|
+ blk_unref(snap_state.target);
|
|
|
|
+ snap_state.target = NULL;
|
2021-02-11 19:11:11 +03:00
|
|
|
+
|
2022-08-18 14:44:16 +03:00
|
|
|
+ qemu_co_sleep_wake(&snap_state.target_close_wait);
|
2017-04-05 11:49:19 +03:00
|
|
|
+ }
|
|
|
|
+
|
|
|
|
+ return ret;
|
|
|
|
+}
|
|
|
|
+
|
2023-08-07 16:19:42 +03:00
|
|
|
+static void G_GNUC_PRINTF(1, 2) save_snapshot_error(const char *fmt, ...)
|
2017-04-05 11:49:19 +03:00
|
|
|
+{
|
|
|
|
+ va_list ap;
|
|
|
|
+ char *msg;
|
|
|
|
+
|
|
|
|
+ va_start(ap, fmt);
|
|
|
|
+ msg = g_strdup_vprintf(fmt, ap);
|
|
|
|
+ va_end(ap);
|
|
|
|
+
|
|
|
|
+ DPRINTF("save_snapshot_error: %s\n", msg);
|
|
|
|
+
|
|
|
|
+ if (!snap_state.error) {
|
|
|
|
+ error_set(&snap_state.error, ERROR_CLASS_GENERIC_ERROR, "%s", msg);
|
|
|
|
+ }
|
|
|
|
+
|
|
|
|
+ g_free (msg);
|
|
|
|
+
|
|
|
|
+ snap_state.state = SAVE_STATE_ERROR;
|
|
|
|
+}
|
|
|
|
+
|
2020-07-02 14:07:28 +03:00
|
|
|
+static void process_savevm_finalize(void *opaque)
|
2019-04-19 10:53:37 +03:00
|
|
|
+{
|
|
|
|
+ int ret;
|
2020-07-02 14:07:28 +03:00
|
|
|
+ AioContext *iohandler_ctx = iohandler_get_aio_context();
|
|
|
|
+ MigrationState *ms = migrate_get_current();
|
|
|
|
+
|
2021-02-11 19:11:11 +03:00
|
|
|
+ bool aborted = savevm_aborted();
|
|
|
|
+
|
2020-07-02 14:07:28 +03:00
|
|
|
+#ifdef DEBUG_SAVEVM_STATE
|
|
|
|
+ int64_t start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
|
|
|
|
+#endif
|
|
|
|
+
|
|
|
|
+ qemu_bh_delete(snap_state.finalize_bh);
|
|
|
|
+ snap_state.finalize_bh = NULL;
|
|
|
|
+ snap_state.co = NULL;
|
|
|
|
+
|
|
|
|
+ /* We need to own the target bdrv's context for the following functions,
|
|
|
|
+ * so move it back. It can stay in the main context and live out its live
|
|
|
|
+ * there, since we're done with it after this method ends anyway.
|
|
|
|
+ */
|
|
|
|
+ aio_context_acquire(iohandler_ctx);
|
|
|
|
+ blk_set_aio_context(snap_state.target, qemu_get_aio_context(), NULL);
|
|
|
|
+ aio_context_release(iohandler_ctx);
|
|
|
|
+
|
|
|
|
+ ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
|
|
|
|
+ if (ret < 0) {
|
|
|
|
+ save_snapshot_error("vm_stop_force_state error %d", ret);
|
|
|
|
+ }
|
|
|
|
+
|
2021-02-11 19:11:11 +03:00
|
|
|
+ if (!aborted) {
|
|
|
|
+ /* skip state saving if we aborted, snapshot will be invalid anyway */
|
|
|
|
+ (void)qemu_savevm_state_complete_precopy(snap_state.file, false, false);
|
|
|
|
+ ret = qemu_file_get_error(snap_state.file);
|
|
|
|
+ if (ret < 0) {
|
2023-01-23 14:43:23 +03:00
|
|
|
+ save_snapshot_error("qemu_savevm_state_complete_precopy error %d", ret);
|
2021-02-11 19:11:11 +03:00
|
|
|
+ }
|
2020-07-02 14:07:28 +03:00
|
|
|
+ }
|
|
|
|
+
|
|
|
|
+ DPRINTF("state saving complete\n");
|
|
|
|
+ DPRINTF("timing: process_savevm_finalize (state saving) took %ld ms\n",
|
|
|
|
+ qemu_clock_get_ms(QEMU_CLOCK_REALTIME) - start_time);
|
|
|
|
+
|
|
|
|
+ /* clear migration state */
|
|
|
|
+ migrate_set_state(&ms->state, MIGRATION_STATUS_SETUP,
|
2021-02-11 19:11:11 +03:00
|
|
|
+ ret || aborted ? MIGRATION_STATUS_FAILED : MIGRATION_STATUS_COMPLETED);
|
2020-07-02 14:07:28 +03:00
|
|
|
+ ms->to_dst_file = NULL;
|
|
|
|
+
|
|
|
|
+ qemu_savevm_state_cleanup();
|
|
|
|
+
|
2019-04-19 10:53:37 +03:00
|
|
|
+ ret = save_snapshot_cleanup();
|
|
|
|
+ if (ret < 0) {
|
|
|
|
+ save_snapshot_error("save_snapshot_cleanup error %d", ret);
|
|
|
|
+ } else if (snap_state.state == SAVE_STATE_ACTIVE) {
|
|
|
|
+ snap_state.state = SAVE_STATE_COMPLETED;
|
2021-02-11 19:11:11 +03:00
|
|
|
+ } else if (aborted) {
|
2022-08-18 14:44:17 +03:00
|
|
|
+ /*
|
|
|
|
+ * If there was an error, there's no need to set a new one here.
|
|
|
|
+ * If the snapshot was canceled, leave setting the state to
|
|
|
|
+ * qmp_savevm_end(), which is waked by save_snapshot_cleanup().
|
|
|
|
+ */
|
2019-04-19 10:53:37 +03:00
|
|
|
+ } else {
|
|
|
|
+ save_snapshot_error("process_savevm_cleanup: invalid state: %d",
|
|
|
|
+ snap_state.state);
|
2017-04-05 11:49:19 +03:00
|
|
|
+ }
|
2019-04-19 10:53:37 +03:00
|
|
|
+ if (snap_state.saved_vm_running) {
|
|
|
|
+ vm_start();
|
|
|
|
+ snap_state.saved_vm_running = false;
|
2017-04-05 11:49:19 +03:00
|
|
|
+ }
|
2020-07-02 14:07:28 +03:00
|
|
|
+
|
|
|
|
+ DPRINTF("timing: process_savevm_finalize (full) took %ld ms\n",
|
|
|
|
+ qemu_clock_get_ms(QEMU_CLOCK_REALTIME) - start_time);
|
2017-04-05 11:49:19 +03:00
|
|
|
+}
|
|
|
|
+
|
2020-07-02 14:07:28 +03:00
|
|
|
+static void coroutine_fn process_savevm_co(void *opaque)
|
2017-04-05 11:49:19 +03:00
|
|
|
+{
|
|
|
|
+ int ret;
|
|
|
|
+ int64_t maxlen;
|
2020-07-02 14:07:28 +03:00
|
|
|
+ BdrvNextIterator it;
|
|
|
|
+ BlockDriverState *bs = NULL;
|
2017-04-05 11:49:19 +03:00
|
|
|
+
|
2020-07-02 14:07:28 +03:00
|
|
|
+#ifdef DEBUG_SAVEVM_STATE
|
|
|
|
+ int64_t start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
|
|
|
|
+#endif
|
2017-04-05 11:49:19 +03:00
|
|
|
+
|
2018-02-22 14:34:57 +03:00
|
|
|
+ ret = qemu_file_get_error(snap_state.file);
|
2017-04-05 11:49:19 +03:00
|
|
|
+ if (ret < 0) {
|
2018-02-22 14:34:57 +03:00
|
|
|
+ save_snapshot_error("qemu_savevm_state_setup failed");
|
2020-07-02 14:07:28 +03:00
|
|
|
+ return;
|
2017-04-05 11:49:19 +03:00
|
|
|
+ }
|
|
|
|
+
|
|
|
|
+ while (snap_state.state == SAVE_STATE_ACTIVE) {
|
update submodule and patches to QEMU 8.0.0
Many changes were necessary this time around:
* QAPI was changed to avoid redundant has_* variables, see commit
44ea9d9be3 ("qapi: Start to elide redundant has_FOO in generated C")
for details. This affected many QMP commands added by Proxmox too.
* Pending querying for migration got split into two functions, one to
estimate, one for exact value, see commit c8df4a7aef ("migration:
Split save_live_pending() into state_pending_*") for details. Relevant
for savevm-async and PBS dirty bitmap.
* Some block (driver) functions got converted to coroutines, so the
Proxmox block drivers needed to be adapted.
* Alloc track auto-detaching during PBS live restore got broken by
AioContext-related changes resulting in a deadlock. The current, hacky
method was replaced by a simpler one. Stefan apparently ran into a
problem with that when he wrote the driver, but there were
improvements in the stream job code since then and I didn't manage to
reproduce the issue. It's a separate patch "alloc-track: fix deadlock
during drop" for now, you can find the details there.
* Async snapshot-related changes:
- The pending querying got adapted to the above-mentioned split and
a patch is added to optimize it/make it more similar to what
upstream code does.
- Added initialization of the compression counters (for
future-proofing).
- It's necessary the hold the BQL (big QEMU lock = iothread mutex)
during the setup phase, because block layer functions are used there
and not doing so leads to racy, hard-to-debug crashes or hangs. It's
necessary to change some upstream code too for this, a version of
the patch "migration: for snapshots, hold the BQL during setup
callbacks" is intended to be upstreamed.
- Need to take the bdrv graph read lock before flushing.
* hmp_info_balloon was moved to a different file.
* Needed to include a new headers from time to time to still get the
correct functions.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2023-05-15 16:39:53 +03:00
|
|
|
+ uint64_t pending_size, pend_precopy, pend_postcopy;
|
squash related patches
where there is no good reason to keep them separate. It's a pain
during rebase if there are multiple patches changing the same code
over and over again. This was especially bad for the backup-related
patches. If the history of patches really is needed, it can be
extracted via git. Additionally, compilation with partial application
of patches was broken since a long time, because one of the master key
changes became part of an earlier patch during a past rebase.
If only the same files were changed by a subsequent patch and the
changes felt to belong together (obvious for later bug fixes, but also
done for features e.g. adding master key support for PBS), the patches
were squashed together.
The PBS namespace support patch was split into the individual parts
it changes, i.e. PBS block driver, pbs-restore binary and QMP backup
infrastructure, and squashed into the respective patches.
No code change is intended, git diff in the submodule should not show
any difference between applying all patches before this commit and
applying all patches after this commit.
The query-proxmox-support QMP function has been left as part of the
"PVE-Backup: Proxmox backup patches for QEMU" patch, because it's
currently only used there. If it ever is used elsewhere too, it can
be split out from there.
The recent alloc-track and BQL-related savevm-async changes have been
left separate for now, because it's not 100% clear they are the best
approach yet. This depends on what upstream decides about the BQL
stuff and whether and what kind of issues with the changes pop up.
The qemu-img dd snapshot patch has been re-ordered to after the other
qemu-img dd patches.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2023-05-15 16:39:56 +03:00
|
|
|
+ uint64_t threshold = 400 * 1000;
|
2017-04-05 11:49:19 +03:00
|
|
|
+
|
squash related patches
where there is no good reason to keep them separate. It's a pain
during rebase if there are multiple patches changing the same code
over and over again. This was especially bad for the backup-related
patches. If the history of patches really is needed, it can be
extracted via git. Additionally, compilation with partial application
of patches was broken since a long time, because one of the master key
changes became part of an earlier patch during a past rebase.
If only the same files were changed by a subsequent patch and the
changes felt to belong together (obvious for later bug fixes, but also
done for features e.g. adding master key support for PBS), the patches
were squashed together.
The PBS namespace support patch was split into the individual parts
it changes, i.e. PBS block driver, pbs-restore binary and QMP backup
infrastructure, and squashed into the respective patches.
No code change is intended, git diff in the submodule should not show
any difference between applying all patches before this commit and
applying all patches after this commit.
The query-proxmox-support QMP function has been left as part of the
"PVE-Backup: Proxmox backup patches for QEMU" patch, because it's
currently only used there. If it ever is used elsewhere too, it can
be split out from there.
The recent alloc-track and BQL-related savevm-async changes have been
left separate for now, because it's not 100% clear they are the best
approach yet. This depends on what upstream decides about the BQL
stuff and whether and what kind of issues with the changes pop up.
The qemu-img dd snapshot patch has been re-ordered to after the other
qemu-img dd patches.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2023-05-15 16:39:56 +03:00
|
|
|
+ /*
|
|
|
|
+ * pending_{estimate,exact} are expected to be called without iothread
|
|
|
|
+ * lock. Similar to what is done in migration.c, call the exact variant
|
|
|
|
+ * only once pend_precopy in the estimate is below the threshold.
|
|
|
|
+ */
|
2021-03-16 19:30:22 +03:00
|
|
|
+ qemu_mutex_unlock_iothread();
|
squash related patches
where there is no good reason to keep them separate. It's a pain
during rebase if there are multiple patches changing the same code
over and over again. This was especially bad for the backup-related
patches. If the history of patches really is needed, it can be
extracted via git. Additionally, compilation with partial application
of patches was broken since a long time, because one of the master key
changes became part of an earlier patch during a past rebase.
If only the same files were changed by a subsequent patch and the
changes felt to belong together (obvious for later bug fixes, but also
done for features e.g. adding master key support for PBS), the patches
were squashed together.
The PBS namespace support patch was split into the individual parts
it changes, i.e. PBS block driver, pbs-restore binary and QMP backup
infrastructure, and squashed into the respective patches.
No code change is intended, git diff in the submodule should not show
any difference between applying all patches before this commit and
applying all patches after this commit.
The query-proxmox-support QMP function has been left as part of the
"PVE-Backup: Proxmox backup patches for QEMU" patch, because it's
currently only used there. If it ever is used elsewhere too, it can
be split out from there.
The recent alloc-track and BQL-related savevm-async changes have been
left separate for now, because it's not 100% clear they are the best
approach yet. This depends on what upstream decides about the BQL
stuff and whether and what kind of issues with the changes pop up.
The qemu-img dd snapshot patch has been re-ordered to after the other
qemu-img dd patches.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2023-05-15 16:39:56 +03:00
|
|
|
+ qemu_savevm_state_pending_estimate(&pend_precopy, &pend_postcopy);
|
|
|
|
+ if (pend_precopy <= threshold) {
|
|
|
|
+ qemu_savevm_state_pending_exact(&pend_precopy, &pend_postcopy);
|
|
|
|
+ }
|
2021-03-16 19:30:22 +03:00
|
|
|
+ qemu_mutex_lock_iothread();
|
update submodule and patches to QEMU 8.0.0
Many changes were necessary this time around:
* QAPI was changed to avoid redundant has_* variables, see commit
44ea9d9be3 ("qapi: Start to elide redundant has_FOO in generated C")
for details. This affected many QMP commands added by Proxmox too.
* Pending querying for migration got split into two functions, one to
estimate, one for exact value, see commit c8df4a7aef ("migration:
Split save_live_pending() into state_pending_*") for details. Relevant
for savevm-async and PBS dirty bitmap.
* Some block (driver) functions got converted to coroutines, so the
Proxmox block drivers needed to be adapted.
* Alloc track auto-detaching during PBS live restore got broken by
AioContext-related changes resulting in a deadlock. The current, hacky
method was replaced by a simpler one. Stefan apparently ran into a
problem with that when he wrote the driver, but there were
improvements in the stream job code since then and I didn't manage to
reproduce the issue. It's a separate patch "alloc-track: fix deadlock
during drop" for now, you can find the details there.
* Async snapshot-related changes:
- The pending querying got adapted to the above-mentioned split and
a patch is added to optimize it/make it more similar to what
upstream code does.
- Added initialization of the compression counters (for
future-proofing).
- It's necessary the hold the BQL (big QEMU lock = iothread mutex)
during the setup phase, because block layer functions are used there
and not doing so leads to racy, hard-to-debug crashes or hangs. It's
necessary to change some upstream code too for this, a version of
the patch "migration: for snapshots, hold the BQL during setup
callbacks" is intended to be upstreamed.
- Need to take the bdrv graph read lock before flushing.
* hmp_info_balloon was moved to a different file.
* Needed to include a new headers from time to time to still get the
correct functions.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2023-05-15 16:39:53 +03:00
|
|
|
+ pending_size = pend_precopy + pend_postcopy;
|
2017-04-05 11:49:19 +03:00
|
|
|
+
|
savevm-async: keep more free space when entering final stage
In qemu-server, we already allocate 2 * $mem_size + 500 MiB for driver
state (which was 32 MiB long ago according to git history). It seems
likely that the 30 MiB cutoff in the savevm-async implementation was
chosen based on that.
In bug #4476 [0], another issue caused the iteration to not make any
progress and the state file filled up all the way to the 30 MiB +
pending_size cutoff. Since the guest is not stopped immediately after
the check, it can still dirty some RAM and the current cutoff is not
enough for a reproducer VM (was done while bug #4476 still was not
fixed), dirtying memory with
> stress-ng -B 2 --bigheap-growth 64.0M'
After entering the final stage, savevm actually filled up the state
file completely, leading to an I/O error. It's probably the same
scenario as reported in the bug report, the error message was fixed in
commit a020815 ("savevm-async: fix function name in error message")
after the bug report.
If not for the bug, the cutoff will only be reached by a VM that's
dirtying RAM faster than can be written to the storage, so increase
the cutoff to 100 MiB to have a bigger chance to finish successfully,
while still trying to not increase downtime too much for
non-hibernation snapshots.
[0]: https://bugzilla.proxmox.com/show_bug.cgi?id=4476
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2023-01-26 16:46:14 +03:00
|
|
|
+ /*
|
|
|
|
+ * A guest reaching this cutoff is dirtying lots of RAM. It should be
|
|
|
|
+ * large enough so that the guest can't dirty this much between the
|
|
|
|
+ * check and the guest actually being stopped, but it should be small
|
|
|
|
+ * enough to avoid long downtimes for non-hibernation snapshots.
|
|
|
|
+ */
|
|
|
|
+ maxlen = blk_getlength(snap_state.target) - 100*1024*1024;
|
2019-04-19 10:53:37 +03:00
|
|
|
+
|
2023-01-26 16:46:13 +03:00
|
|
|
+ /* Note that there is no progress for pend_postcopy when iterating */
|
squash related patches
where there is no good reason to keep them separate. It's a pain
during rebase if there are multiple patches changing the same code
over and over again. This was especially bad for the backup-related
patches. If the history of patches really is needed, it can be
extracted via git. Additionally, compilation with partial application
of patches was broken since a long time, because one of the master key
changes became part of an earlier patch during a past rebase.
If only the same files were changed by a subsequent patch and the
changes felt to belong together (obvious for later bug fixes, but also
done for features e.g. adding master key support for PBS), the patches
were squashed together.
The PBS namespace support patch was split into the individual parts
it changes, i.e. PBS block driver, pbs-restore binary and QMP backup
infrastructure, and squashed into the respective patches.
No code change is intended, git diff in the submodule should not show
any difference between applying all patches before this commit and
applying all patches after this commit.
The query-proxmox-support QMP function has been left as part of the
"PVE-Backup: Proxmox backup patches for QEMU" patch, because it's
currently only used there. If it ever is used elsewhere too, it can
be split out from there.
The recent alloc-track and BQL-related savevm-async changes have been
left separate for now, because it's not 100% clear they are the best
approach yet. This depends on what upstream decides about the BQL
stuff and whether and what kind of issues with the changes pop up.
The qemu-img dd snapshot patch has been re-ordered to after the other
qemu-img dd patches.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2023-05-15 16:39:56 +03:00
|
|
|
+ if (pend_precopy > threshold && snap_state.bs_pos + pending_size < maxlen) {
|
2019-04-19 10:53:37 +03:00
|
|
|
+ ret = qemu_savevm_state_iterate(snap_state.file, false);
|
|
|
|
+ if (ret < 0) {
|
|
|
|
+ save_snapshot_error("qemu_savevm_state_iterate error %d", ret);
|
|
|
|
+ break;
|
|
|
|
+ }
|
2020-07-02 14:07:28 +03:00
|
|
|
+ DPRINTF("savevm iterate pending size %lu ret %d\n", pending_size, ret);
|
2017-04-05 11:49:19 +03:00
|
|
|
+ } else {
|
2019-06-06 13:58:15 +03:00
|
|
|
+ qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
|
2023-10-17 15:10:09 +03:00
|
|
|
+ global_state_store();
|
2020-07-02 14:07:28 +03:00
|
|
|
+
|
|
|
|
+ DPRINTF("savevm iterate complete\n");
|
2017-04-05 11:49:19 +03:00
|
|
|
+ break;
|
|
|
|
+ }
|
|
|
|
+ }
|
|
|
|
+
|
2020-07-02 14:07:28 +03:00
|
|
|
+ DPRINTF("timing: process_savevm_co took %ld ms\n",
|
|
|
|
+ qemu_clock_get_ms(QEMU_CLOCK_REALTIME) - start_time);
|
|
|
|
+
|
|
|
|
+#ifdef DEBUG_SAVEVM_STATE
|
|
|
|
+ int64_t start_time_flush = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
|
|
|
|
+#endif
|
|
|
|
+ /* If a drive runs in an IOThread we can flush it async, and only
|
|
|
|
+ * need to sync-flush whatever IO happens between now and
|
|
|
|
+ * vm_stop_force_state. bdrv_next can only be called from main AioContext,
|
|
|
|
+ * so move there now and after every flush.
|
|
|
|
+ */
|
|
|
|
+ aio_co_reschedule_self(qemu_get_aio_context());
|
update submodule and patches to QEMU 8.2.2
This version includes both the AioContext lock and the block graph
lock, so there might be some deadlocks lurking. It's not possible to
disable the block graph lock like was done in QEMU 8.1, because there
are no changes like the function bdrv_schedule_unref() that require
it. QEMU 9.0 will finally get rid of the AioContext locking.
During live-restore with a VirtIO SCSI drive with iothread there is a
known racy deadlock related to the AioContext lock. Not new [1], but
not sure if more likely now. Should be fixed in QEMU 9.0.
The block graph lock comes with annotations that can be checked by
clang's TSA. This required changes to the block drivers, i.e.
alloc-track, pbs, zeroinit as well as taking the appropriate locks
in pve-backup, savevm-async, vma-reader.
Local variable shadowing is prohibited via a compiler flag now,
required slight adaptation in vma.c.
Major changes only affect alloc-track:
* It is not possible to call a generated co-wrapper like
bdrv_get_info() while holding the block graph lock exclusively [0],
which does happen during initialization of alloc-track when the
backing hd is set and the refresh_limits driver callback is invoked.
The bdrv_get_info() call to get the cluster size is moved to
directly after opening the file child in track_open().
The important thing is that at least the request alignment for the
write target is used, because then the RMW cycle in bdrv_pwritev
will gather enough data from the backing file. Partial cluster
allocations in the target are not a fundamental issue, because the
driver returns its allocation status based on the bitmap, so any
other data that maps to the same cluster will still be copied later
by a stream job (or during writes to that cluster).
* Replacing the node cannot be done in the
track_co_change_backing_file() callback, because it is a coroutine
and cannot hold the block graph lock exclusively. So it is moved to
the stream job itself with the auto-remove option not having an
effect anymore (qemu-server would always set it anyways).
In the future, there could either be a special option for the stream
job, or maybe the upcoming blockdev-replace QMP command can be used.
Replacing the backing child is actually already done in the stream
job, so no need to do it in the track_co_change_backing_file()
callback. It also cannot be called from a coroutine. Looking at the
implementation in the qcow2 driver, it doesn't seem to be intended
to change the backing child itself, just update driver-internal
state.
Other changes:
* alloc-track: Error out early when used without auto-remove. Since
replacing the node now happens in the stream job, where the option
cannot be read from (it's internal to the driver), it will always be
treated as 'on'. Makes sure to have users beside qemu-server notice
the change (should they even exist). The option can be fully dropped
in the future while adding a version guard in qemu-server.
* alloc-track: Avoid seemingly superfluous child permission update.
Doesn't seem necessary nowadays (maybe after commit "alloc-track:
fix deadlock during drop" where the dropping is not rescheduled and
delayed anymore or some upstream change). Replacing the block node
will already update the permissions of the new node (which was the
file child before). Should there really be some issue, instead of
having a drop state, this could also be just based off the fact
whether there is still a backing child.
Dumping the cumulative (shared) permissions for the BDS with a debug
print yields the same values after this patch and with QEMU 8.1,
namely 3 and 5.
* PBS block driver: compile unconditionally. Proxmox VE always needs
it and something in the build process changed to make it not enabled
by default. Probably would need to move the build option to meson
otherwise.
* backup: job unreferencing during cleanup needs to happen outside of
coroutine, so it was moved to before invoking the clean
* mirror: Cherry-pick stable fix to avoid potential deadlock.
* savevm-async: migrate_init now can fail, so propagate potential
error.
* savevm-async: compression counters are not accessible outside
migration/ram-compress now, so drop code that prophylactically set
it to zero.
[0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/
[1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
|
|
|
+ bdrv_graph_co_rdlock();
|
|
|
|
+ bs = bdrv_first(&it);
|
|
|
|
+ bdrv_graph_co_rdunlock();
|
|
|
|
+ while (bs) {
|
2020-07-02 14:07:28 +03:00
|
|
|
+ /* target has BDRV_O_NO_FLUSH, no sense calling bdrv_flush on it */
|
update submodule and patches to QEMU 8.2.2
This version includes both the AioContext lock and the block graph
lock, so there might be some deadlocks lurking. It's not possible to
disable the block graph lock like was done in QEMU 8.1, because there
are no changes like the function bdrv_schedule_unref() that require
it. QEMU 9.0 will finally get rid of the AioContext locking.
During live-restore with a VirtIO SCSI drive with iothread there is a
known racy deadlock related to the AioContext lock. Not new [1], but
not sure if more likely now. Should be fixed in QEMU 9.0.
The block graph lock comes with annotations that can be checked by
clang's TSA. This required changes to the block drivers, i.e.
alloc-track, pbs, zeroinit as well as taking the appropriate locks
in pve-backup, savevm-async, vma-reader.
Local variable shadowing is prohibited via a compiler flag now,
required slight adaptation in vma.c.
Major changes only affect alloc-track:
* It is not possible to call a generated co-wrapper like
bdrv_get_info() while holding the block graph lock exclusively [0],
which does happen during initialization of alloc-track when the
backing hd is set and the refresh_limits driver callback is invoked.
The bdrv_get_info() call to get the cluster size is moved to
directly after opening the file child in track_open().
The important thing is that at least the request alignment for the
write target is used, because then the RMW cycle in bdrv_pwritev
will gather enough data from the backing file. Partial cluster
allocations in the target are not a fundamental issue, because the
driver returns its allocation status based on the bitmap, so any
other data that maps to the same cluster will still be copied later
by a stream job (or during writes to that cluster).
* Replacing the node cannot be done in the
track_co_change_backing_file() callback, because it is a coroutine
and cannot hold the block graph lock exclusively. So it is moved to
the stream job itself with the auto-remove option not having an
effect anymore (qemu-server would always set it anyways).
In the future, there could either be a special option for the stream
job, or maybe the upcoming blockdev-replace QMP command can be used.
Replacing the backing child is actually already done in the stream
job, so no need to do it in the track_co_change_backing_file()
callback. It also cannot be called from a coroutine. Looking at the
implementation in the qcow2 driver, it doesn't seem to be intended
to change the backing child itself, just update driver-internal
state.
Other changes:
* alloc-track: Error out early when used without auto-remove. Since
replacing the node now happens in the stream job, where the option
cannot be read from (it's internal to the driver), it will always be
treated as 'on'. Makes sure to have users beside qemu-server notice
the change (should they even exist). The option can be fully dropped
in the future while adding a version guard in qemu-server.
* alloc-track: Avoid seemingly superfluous child permission update.
Doesn't seem necessary nowadays (maybe after commit "alloc-track:
fix deadlock during drop" where the dropping is not rescheduled and
delayed anymore or some upstream change). Replacing the block node
will already update the permissions of the new node (which was the
file child before). Should there really be some issue, instead of
having a drop state, this could also be just based off the fact
whether there is still a backing child.
Dumping the cumulative (shared) permissions for the BDS with a debug
print yields the same values after this patch and with QEMU 8.1,
namely 3 and 5.
* PBS block driver: compile unconditionally. Proxmox VE always needs
it and something in the build process changed to make it not enabled
by default. Probably would need to move the build option to meson
otherwise.
* backup: job unreferencing during cleanup needs to happen outside of
coroutine, so it was moved to before invoking the clean
* mirror: Cherry-pick stable fix to avoid potential deadlock.
* savevm-async: migrate_init now can fail, so propagate potential
error.
* savevm-async: compression counters are not accessible outside
migration/ram-compress now, so drop code that prophylactically set
it to zero.
[0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/
[1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
|
|
|
+ if (bs != blk_bs(snap_state.target)) {
|
|
|
|
+ AioContext *bs_ctx = bdrv_get_aio_context(bs);
|
|
|
|
+ if (bs_ctx != qemu_get_aio_context()) {
|
|
|
|
+ DPRINTF("savevm: async flushing drive %s\n", bs->filename);
|
|
|
|
+ aio_co_reschedule_self(bs_ctx);
|
|
|
|
+ bdrv_graph_co_rdlock();
|
|
|
|
+ bdrv_flush(bs);
|
|
|
|
+ bdrv_graph_co_rdunlock();
|
|
|
|
+ aio_co_reschedule_self(qemu_get_aio_context());
|
|
|
|
+ }
|
2020-07-02 14:07:28 +03:00
|
|
|
+ }
|
update submodule and patches to QEMU 8.2.2
This version includes both the AioContext lock and the block graph
lock, so there might be some deadlocks lurking. It's not possible to
disable the block graph lock like was done in QEMU 8.1, because there
are no changes like the function bdrv_schedule_unref() that require
it. QEMU 9.0 will finally get rid of the AioContext locking.
During live-restore with a VirtIO SCSI drive with iothread there is a
known racy deadlock related to the AioContext lock. Not new [1], but
not sure if more likely now. Should be fixed in QEMU 9.0.
The block graph lock comes with annotations that can be checked by
clang's TSA. This required changes to the block drivers, i.e.
alloc-track, pbs, zeroinit as well as taking the appropriate locks
in pve-backup, savevm-async, vma-reader.
Local variable shadowing is prohibited via a compiler flag now,
required slight adaptation in vma.c.
Major changes only affect alloc-track:
* It is not possible to call a generated co-wrapper like
bdrv_get_info() while holding the block graph lock exclusively [0],
which does happen during initialization of alloc-track when the
backing hd is set and the refresh_limits driver callback is invoked.
The bdrv_get_info() call to get the cluster size is moved to
directly after opening the file child in track_open().
The important thing is that at least the request alignment for the
write target is used, because then the RMW cycle in bdrv_pwritev
will gather enough data from the backing file. Partial cluster
allocations in the target are not a fundamental issue, because the
driver returns its allocation status based on the bitmap, so any
other data that maps to the same cluster will still be copied later
by a stream job (or during writes to that cluster).
* Replacing the node cannot be done in the
track_co_change_backing_file() callback, because it is a coroutine
and cannot hold the block graph lock exclusively. So it is moved to
the stream job itself with the auto-remove option not having an
effect anymore (qemu-server would always set it anyways).
In the future, there could either be a special option for the stream
job, or maybe the upcoming blockdev-replace QMP command can be used.
Replacing the backing child is actually already done in the stream
job, so no need to do it in the track_co_change_backing_file()
callback. It also cannot be called from a coroutine. Looking at the
implementation in the qcow2 driver, it doesn't seem to be intended
to change the backing child itself, just update driver-internal
state.
Other changes:
* alloc-track: Error out early when used without auto-remove. Since
replacing the node now happens in the stream job, where the option
cannot be read from (it's internal to the driver), it will always be
treated as 'on'. Makes sure to have users beside qemu-server notice
the change (should they even exist). The option can be fully dropped
in the future while adding a version guard in qemu-server.
* alloc-track: Avoid seemingly superfluous child permission update.
Doesn't seem necessary nowadays (maybe after commit "alloc-track:
fix deadlock during drop" where the dropping is not rescheduled and
delayed anymore or some upstream change). Replacing the block node
will already update the permissions of the new node (which was the
file child before). Should there really be some issue, instead of
having a drop state, this could also be just based off the fact
whether there is still a backing child.
Dumping the cumulative (shared) permissions for the BDS with a debug
print yields the same values after this patch and with QEMU 8.1,
namely 3 and 5.
* PBS block driver: compile unconditionally. Proxmox VE always needs
it and something in the build process changed to make it not enabled
by default. Probably would need to move the build option to meson
otherwise.
* backup: job unreferencing during cleanup needs to happen outside of
coroutine, so it was moved to before invoking the clean
* mirror: Cherry-pick stable fix to avoid potential deadlock.
* savevm-async: migrate_init now can fail, so propagate potential
error.
* savevm-async: compression counters are not accessible outside
migration/ram-compress now, so drop code that prophylactically set
it to zero.
[0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/
[1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
|
|
|
+ bdrv_graph_co_rdlock();
|
|
|
|
+ bs = bdrv_next(&it);
|
|
|
|
+ bdrv_graph_co_rdunlock();
|
2020-07-02 14:07:28 +03:00
|
|
|
+ }
|
|
|
|
+
|
|
|
|
+ DPRINTF("timing: async flushing took %ld ms\n",
|
|
|
|
+ qemu_clock_get_ms(QEMU_CLOCK_REALTIME) - start_time_flush);
|
2017-04-05 11:49:19 +03:00
|
|
|
+
|
2020-07-02 14:07:28 +03:00
|
|
|
+ qemu_bh_schedule(snap_state.finalize_bh);
|
2017-04-05 11:49:19 +03:00
|
|
|
+}
|
|
|
|
+
|
update submodule and patches to QEMU 8.0.0
Many changes were necessary this time around:
* QAPI was changed to avoid redundant has_* variables, see commit
44ea9d9be3 ("qapi: Start to elide redundant has_FOO in generated C")
for details. This affected many QMP commands added by Proxmox too.
* Pending querying for migration got split into two functions, one to
estimate, one for exact value, see commit c8df4a7aef ("migration:
Split save_live_pending() into state_pending_*") for details. Relevant
for savevm-async and PBS dirty bitmap.
* Some block (driver) functions got converted to coroutines, so the
Proxmox block drivers needed to be adapted.
* Alloc track auto-detaching during PBS live restore got broken by
AioContext-related changes resulting in a deadlock. The current, hacky
method was replaced by a simpler one. Stefan apparently ran into a
problem with that when he wrote the driver, but there were
improvements in the stream job code since then and I didn't manage to
reproduce the issue. It's a separate patch "alloc-track: fix deadlock
during drop" for now, you can find the details there.
* Async snapshot-related changes:
- The pending querying got adapted to the above-mentioned split and
a patch is added to optimize it/make it more similar to what
upstream code does.
- Added initialization of the compression counters (for
future-proofing).
- It's necessary the hold the BQL (big QEMU lock = iothread mutex)
during the setup phase, because block layer functions are used there
and not doing so leads to racy, hard-to-debug crashes or hangs. It's
necessary to change some upstream code too for this, a version of
the patch "migration: for snapshots, hold the BQL during setup
callbacks" is intended to be upstreamed.
- Need to take the bdrv graph read lock before flushing.
* hmp_info_balloon was moved to a different file.
* Needed to include a new headers from time to time to still get the
correct functions.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2023-05-15 16:39:53 +03:00
|
|
|
+void qmp_savevm_start(const char *statefile, Error **errp)
|
2017-04-05 11:49:19 +03:00
|
|
|
+{
|
|
|
|
+ Error *local_err = NULL;
|
2020-07-02 14:07:28 +03:00
|
|
|
+ MigrationState *ms = migrate_get_current();
|
|
|
|
+ AioContext *iohandler_ctx = iohandler_get_aio_context();
|
2017-04-05 11:49:19 +03:00
|
|
|
+
|
2017-08-07 10:10:07 +03:00
|
|
|
+ int bdrv_oflags = BDRV_O_RDWR | BDRV_O_RESIZE | BDRV_O_NO_FLUSH;
|
2017-04-05 11:49:19 +03:00
|
|
|
+
|
|
|
|
+ if (snap_state.state != SAVE_STATE_DONE) {
|
|
|
|
+ error_set(errp, ERROR_CLASS_GENERIC_ERROR,
|
|
|
|
+ "VM snapshot already started\n");
|
|
|
|
+ return;
|
|
|
|
+ }
|
|
|
|
+
|
2020-07-02 14:07:28 +03:00
|
|
|
+ if (migration_is_running(ms->state)) {
|
|
|
|
+ error_set(errp, ERROR_CLASS_GENERIC_ERROR, QERR_MIGRATION_ACTIVE);
|
|
|
|
+ return;
|
|
|
|
+ }
|
|
|
|
+
|
2023-10-17 15:10:09 +03:00
|
|
|
+ if (migrate_block()) {
|
2020-07-02 14:07:28 +03:00
|
|
|
+ error_set(errp, ERROR_CLASS_GENERIC_ERROR,
|
|
|
|
+ "Block migration and snapshots are incompatible");
|
|
|
|
+ return;
|
|
|
|
+ }
|
|
|
|
+
|
2017-04-05 11:49:19 +03:00
|
|
|
+ /* initialize snapshot info */
|
|
|
|
+ snap_state.saved_vm_running = runstate_is_running();
|
|
|
|
+ snap_state.bs_pos = 0;
|
|
|
|
+ snap_state.total_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
|
|
|
|
+ snap_state.blocker = NULL;
|
2022-10-14 15:07:15 +03:00
|
|
|
+ snap_state.target_close_wait = (QemuCoSleep){ .to_wake = NULL };
|
2017-04-05 11:49:19 +03:00
|
|
|
+
|
|
|
|
+ if (snap_state.error) {
|
|
|
|
+ error_free(snap_state.error);
|
|
|
|
+ snap_state.error = NULL;
|
|
|
|
+ }
|
|
|
|
+
|
update submodule and patches to QEMU 8.0.0
Many changes were necessary this time around:
* QAPI was changed to avoid redundant has_* variables, see commit
44ea9d9be3 ("qapi: Start to elide redundant has_FOO in generated C")
for details. This affected many QMP commands added by Proxmox too.
* Pending querying for migration got split into two functions, one to
estimate, one for exact value, see commit c8df4a7aef ("migration:
Split save_live_pending() into state_pending_*") for details. Relevant
for savevm-async and PBS dirty bitmap.
* Some block (driver) functions got converted to coroutines, so the
Proxmox block drivers needed to be adapted.
* Alloc track auto-detaching during PBS live restore got broken by
AioContext-related changes resulting in a deadlock. The current, hacky
method was replaced by a simpler one. Stefan apparently ran into a
problem with that when he wrote the driver, but there were
improvements in the stream job code since then and I didn't manage to
reproduce the issue. It's a separate patch "alloc-track: fix deadlock
during drop" for now, you can find the details there.
* Async snapshot-related changes:
- The pending querying got adapted to the above-mentioned split and
a patch is added to optimize it/make it more similar to what
upstream code does.
- Added initialization of the compression counters (for
future-proofing).
- It's necessary the hold the BQL (big QEMU lock = iothread mutex)
during the setup phase, because block layer functions are used there
and not doing so leads to racy, hard-to-debug crashes or hangs. It's
necessary to change some upstream code too for this, a version of
the patch "migration: for snapshots, hold the BQL during setup
callbacks" is intended to be upstreamed.
- Need to take the bdrv graph read lock before flushing.
* hmp_info_balloon was moved to a different file.
* Needed to include a new headers from time to time to still get the
correct functions.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2023-05-15 16:39:53 +03:00
|
|
|
+ if (!statefile) {
|
2017-04-05 11:49:19 +03:00
|
|
|
+ vm_stop(RUN_STATE_SAVE_VM);
|
|
|
|
+ snap_state.state = SAVE_STATE_COMPLETED;
|
|
|
|
+ return;
|
|
|
|
+ }
|
|
|
|
+
|
|
|
|
+ if (qemu_savevm_state_blocked(errp)) {
|
|
|
|
+ return;
|
|
|
|
+ }
|
|
|
|
+
|
|
|
|
+ /* Open the image */
|
|
|
|
+ QDict *options = NULL;
|
|
|
|
+ options = qdict_new();
|
2018-08-30 16:00:07 +03:00
|
|
|
+ qdict_put_str(options, "driver", "raw");
|
2017-08-07 10:10:07 +03:00
|
|
|
+ snap_state.target = blk_new_open(statefile, NULL, options, bdrv_oflags, &local_err);
|
|
|
|
+ if (!snap_state.target) {
|
2017-04-05 11:49:19 +03:00
|
|
|
+ error_set(errp, ERROR_CLASS_GENERIC_ERROR, "failed to open '%s'", statefile);
|
|
|
|
+ goto restart;
|
|
|
|
+ }
|
|
|
|
+
|
update submodule and patches to 7.1.0
Notable changes:
* The only big change is the switch to using a custom QIOChannel for
savevm-async, because the previously used QEMUFileOps was dropped.
Changes to the current implementation:
* Switch to vector based methods as required for an IO channel. For
short reads the passed-in IO vector is stuffed with zeroes at the
end, just to be sure.
* For reading: The documentation in include/io/channel.h states that
at least one byte should be read, so also error out when whe are
at the very end instead of returning 0.
* For reading: Fix off-by-one error when request goes beyond end.
The wrong code piece was:
if ((pos + size) > maxlen) {
size = maxlen - pos - 1;
}
Previously, the last byte would not be read. It's actually
possible to get a snapshot .raw file that has content all the way
up the final 512 byte (= BDRV_SECTOR_SIZE) boundary without any
trailing zero bytes (I wrote a script to do it).
Luckily, it didn't cause a real issue, because qemu_loadvm_state()
is not interested in the final (i.e. QEMU_VM_VMDESCRIPTION)
section. The buffer for reading it is simply freed up afterwards
and the function will assume that it read the whole section, even
if that's not the case.
* For writing: Make use of the generated blk_pwritev() wrapper
instead of manually wrapping the coroutine to simplify and save a
few lines.
* Adapt to changed interfaces for blk_{pread,pwrite}:
* a9262f551e ("block: Change blk_{pread,pwrite}() param order")
* 3b35d4542c ("block: Add a 'flags' param to blk_pread()")
* bf5b16fa40 ("block: Make blk_{pread,pwrite}() return 0 on success")
Those changes especially affected the qemu-img dd patches, because
the context also changed, but also some of our block drivers used
the functions.
* Drop qemu-common.h include: it got renamed after essentially
everything was moved to other headers. The only remaining user I
could find for things dropped from the header between 7.0 and 7.1
was qemu_get_vm_name() in the iscsi-initiatorname patch, but it
already includes the header to which the function was moved.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2022-10-14 15:07:13 +03:00
|
|
|
+ QIOChannel *ioc = QIO_CHANNEL(qio_channel_savevm_async_new(snap_state.target,
|
|
|
|
+ &snap_state.bs_pos));
|
|
|
|
+ snap_state.file = qemu_file_new_output(ioc);
|
2017-04-05 11:49:19 +03:00
|
|
|
+
|
|
|
|
+ if (!snap_state.file) {
|
|
|
|
+ error_set(errp, ERROR_CLASS_GENERIC_ERROR, "failed to open '%s'", statefile);
|
|
|
|
+ goto restart;
|
|
|
|
+ }
|
|
|
|
+
|
2020-07-02 14:07:28 +03:00
|
|
|
+ /*
|
|
|
|
+ * qemu_savevm_* paths use migration code and expect a migration state.
|
|
|
|
+ * State is cleared in process_savevm_co, but has to be initialized
|
|
|
|
+ * here (blocking main thread, from QMP) to avoid race conditions.
|
|
|
|
+ */
|
update submodule and patches to QEMU 8.2.2
This version includes both the AioContext lock and the block graph
lock, so there might be some deadlocks lurking. It's not possible to
disable the block graph lock like was done in QEMU 8.1, because there
are no changes like the function bdrv_schedule_unref() that require
it. QEMU 9.0 will finally get rid of the AioContext locking.
During live-restore with a VirtIO SCSI drive with iothread there is a
known racy deadlock related to the AioContext lock. Not new [1], but
not sure if more likely now. Should be fixed in QEMU 9.0.
The block graph lock comes with annotations that can be checked by
clang's TSA. This required changes to the block drivers, i.e.
alloc-track, pbs, zeroinit as well as taking the appropriate locks
in pve-backup, savevm-async, vma-reader.
Local variable shadowing is prohibited via a compiler flag now,
required slight adaptation in vma.c.
Major changes only affect alloc-track:
* It is not possible to call a generated co-wrapper like
bdrv_get_info() while holding the block graph lock exclusively [0],
which does happen during initialization of alloc-track when the
backing hd is set and the refresh_limits driver callback is invoked.
The bdrv_get_info() call to get the cluster size is moved to
directly after opening the file child in track_open().
The important thing is that at least the request alignment for the
write target is used, because then the RMW cycle in bdrv_pwritev
will gather enough data from the backing file. Partial cluster
allocations in the target are not a fundamental issue, because the
driver returns its allocation status based on the bitmap, so any
other data that maps to the same cluster will still be copied later
by a stream job (or during writes to that cluster).
* Replacing the node cannot be done in the
track_co_change_backing_file() callback, because it is a coroutine
and cannot hold the block graph lock exclusively. So it is moved to
the stream job itself with the auto-remove option not having an
effect anymore (qemu-server would always set it anyways).
In the future, there could either be a special option for the stream
job, or maybe the upcoming blockdev-replace QMP command can be used.
Replacing the backing child is actually already done in the stream
job, so no need to do it in the track_co_change_backing_file()
callback. It also cannot be called from a coroutine. Looking at the
implementation in the qcow2 driver, it doesn't seem to be intended
to change the backing child itself, just update driver-internal
state.
Other changes:
* alloc-track: Error out early when used without auto-remove. Since
replacing the node now happens in the stream job, where the option
cannot be read from (it's internal to the driver), it will always be
treated as 'on'. Makes sure to have users beside qemu-server notice
the change (should they even exist). The option can be fully dropped
in the future while adding a version guard in qemu-server.
* alloc-track: Avoid seemingly superfluous child permission update.
Doesn't seem necessary nowadays (maybe after commit "alloc-track:
fix deadlock during drop" where the dropping is not rescheduled and
delayed anymore or some upstream change). Replacing the block node
will already update the permissions of the new node (which was the
file child before). Should there really be some issue, instead of
having a drop state, this could also be just based off the fact
whether there is still a backing child.
Dumping the cumulative (shared) permissions for the BDS with a debug
print yields the same values after this patch and with QEMU 8.1,
namely 3 and 5.
* PBS block driver: compile unconditionally. Proxmox VE always needs
it and something in the build process changed to make it not enabled
by default. Probably would need to move the build option to meson
otherwise.
* backup: job unreferencing during cleanup needs to happen outside of
coroutine, so it was moved to before invoking the clean
* mirror: Cherry-pick stable fix to avoid potential deadlock.
* savevm-async: migrate_init now can fail, so propagate potential
error.
* savevm-async: compression counters are not accessible outside
migration/ram-compress now, so drop code that prophylactically set
it to zero.
[0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/
[1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
|
|
|
+ if (migrate_init(ms, errp)) {
|
|
|
|
+ return;
|
|
|
|
+ }
|
2023-10-17 15:10:09 +03:00
|
|
|
+ memset(&mig_stats, 0, sizeof(mig_stats));
|
2020-07-02 14:07:28 +03:00
|
|
|
+ ms->to_dst_file = snap_state.file;
|
2017-04-05 11:49:19 +03:00
|
|
|
+
|
|
|
|
+ error_setg(&snap_state.blocker, "block device is in use by savevm");
|
2017-08-07 10:10:07 +03:00
|
|
|
+ blk_op_block_all(snap_state.target, snap_state.blocker);
|
2017-04-05 11:49:19 +03:00
|
|
|
+
|
2019-04-19 10:53:37 +03:00
|
|
|
+ snap_state.state = SAVE_STATE_ACTIVE;
|
2020-07-02 14:07:28 +03:00
|
|
|
+ snap_state.finalize_bh = qemu_bh_new(process_savevm_finalize, &snap_state);
|
|
|
|
+ snap_state.co = qemu_coroutine_create(&process_savevm_co, NULL);
|
|
|
|
+ qemu_savevm_state_header(snap_state.file);
|
|
|
|
+ qemu_savevm_state_setup(snap_state.file);
|
|
|
|
+
|
|
|
|
+ /* Async processing from here on out happens in iohandler context, so let
|
|
|
|
+ * the target bdrv have its home there.
|
|
|
|
+ */
|
|
|
|
+ blk_set_aio_context(snap_state.target, iohandler_ctx, &local_err);
|
|
|
|
+
|
|
|
|
+ aio_co_schedule(iohandler_ctx, snap_state.co);
|
2017-04-05 11:49:19 +03:00
|
|
|
+
|
|
|
|
+ return;
|
|
|
|
+
|
|
|
|
+restart:
|
|
|
|
+
|
|
|
|
+ save_snapshot_error("setup failed");
|
|
|
|
+
|
|
|
|
+ if (snap_state.saved_vm_running) {
|
|
|
|
+ vm_start();
|
2021-02-11 19:11:11 +03:00
|
|
|
+ snap_state.saved_vm_running = false;
|
2017-04-05 11:49:19 +03:00
|
|
|
+ }
|
|
|
|
+}
|
|
|
|
+
|
2021-02-11 19:11:11 +03:00
|
|
|
+void coroutine_fn qmp_savevm_end(Error **errp)
|
2017-04-05 11:49:19 +03:00
|
|
|
+{
|
2021-02-11 19:11:11 +03:00
|
|
|
+ int64_t timeout;
|
|
|
|
+
|
2017-04-05 11:49:19 +03:00
|
|
|
+ if (snap_state.state == SAVE_STATE_DONE) {
|
|
|
|
+ error_set(errp, ERROR_CLASS_GENERIC_ERROR,
|
|
|
|
+ "VM snapshot not started\n");
|
|
|
|
+ return;
|
|
|
|
+ }
|
|
|
|
+
|
|
|
|
+ if (snap_state.state == SAVE_STATE_ACTIVE) {
|
|
|
|
+ snap_state.state = SAVE_STATE_CANCELLED;
|
2021-02-11 19:11:11 +03:00
|
|
|
+ goto wait_for_close;
|
2017-04-05 11:49:19 +03:00
|
|
|
+ }
|
|
|
|
+
|
|
|
|
+ if (snap_state.saved_vm_running) {
|
|
|
|
+ vm_start();
|
2021-02-11 19:11:11 +03:00
|
|
|
+ snap_state.saved_vm_running = false;
|
2017-04-05 11:49:19 +03:00
|
|
|
+ }
|
|
|
|
+
|
|
|
|
+ snap_state.state = SAVE_STATE_DONE;
|
2021-02-11 19:11:11 +03:00
|
|
|
+
|
|
|
|
+wait_for_close:
|
|
|
|
+ if (!snap_state.target) {
|
|
|
|
+ DPRINTF("savevm-end: no target file open\n");
|
|
|
|
+ return;
|
|
|
|
+ }
|
|
|
|
+
|
|
|
|
+ /* wait until cleanup is done before returning, this ensures that after this
|
|
|
|
+ * call exits the statefile will be closed and can be removed immediately */
|
|
|
|
+ DPRINTF("savevm-end: waiting for cleanup\n");
|
|
|
|
+ timeout = 30L * 1000 * 1000 * 1000;
|
2022-08-18 14:44:16 +03:00
|
|
|
+ qemu_co_sleep_ns_wakeable(&snap_state.target_close_wait,
|
2021-10-11 14:55:34 +03:00
|
|
|
+ QEMU_CLOCK_REALTIME, timeout);
|
2021-02-11 19:11:11 +03:00
|
|
|
+ if (snap_state.target) {
|
|
|
|
+ save_snapshot_error("timeout waiting for target file close in "
|
|
|
|
+ "qmp_savevm_end");
|
|
|
|
+ /* we cannot assume the snapshot finished in this case, so leave the
|
|
|
|
+ * state alone - caller has to figure something out */
|
|
|
|
+ return;
|
|
|
|
+ }
|
|
|
|
+
|
2022-08-18 14:44:17 +03:00
|
|
|
+ // File closed and no other error, so ensure next snapshot can be started.
|
|
|
|
+ if (snap_state.state != SAVE_STATE_ERROR) {
|
|
|
|
+ snap_state.state = SAVE_STATE_DONE;
|
|
|
|
+ }
|
|
|
|
+
|
2021-02-11 19:11:11 +03:00
|
|
|
+ DPRINTF("savevm-end: cleanup done\n");
|
2017-04-05 11:49:19 +03:00
|
|
|
+}
|
|
|
|
+
|
2018-02-22 14:34:57 +03:00
|
|
|
+int load_snapshot_from_blockdev(const char *filename, Error **errp)
|
2017-04-05 11:49:19 +03:00
|
|
|
+{
|
2017-08-07 10:10:07 +03:00
|
|
|
+ BlockBackend *be;
|
2017-04-05 11:49:19 +03:00
|
|
|
+ Error *local_err = NULL;
|
|
|
|
+ Error *blocker = NULL;
|
|
|
|
+
|
|
|
|
+ QEMUFile *f;
|
update submodule and patches to 7.1.0
Notable changes:
* The only big change is the switch to using a custom QIOChannel for
savevm-async, because the previously used QEMUFileOps was dropped.
Changes to the current implementation:
* Switch to vector based methods as required for an IO channel. For
short reads the passed-in IO vector is stuffed with zeroes at the
end, just to be sure.
* For reading: The documentation in include/io/channel.h states that
at least one byte should be read, so also error out when whe are
at the very end instead of returning 0.
* For reading: Fix off-by-one error when request goes beyond end.
The wrong code piece was:
if ((pos + size) > maxlen) {
size = maxlen - pos - 1;
}
Previously, the last byte would not be read. It's actually
possible to get a snapshot .raw file that has content all the way
up the final 512 byte (= BDRV_SECTOR_SIZE) boundary without any
trailing zero bytes (I wrote a script to do it).
Luckily, it didn't cause a real issue, because qemu_loadvm_state()
is not interested in the final (i.e. QEMU_VM_VMDESCRIPTION)
section. The buffer for reading it is simply freed up afterwards
and the function will assume that it read the whole section, even
if that's not the case.
* For writing: Make use of the generated blk_pwritev() wrapper
instead of manually wrapping the coroutine to simplify and save a
few lines.
* Adapt to changed interfaces for blk_{pread,pwrite}:
* a9262f551e ("block: Change blk_{pread,pwrite}() param order")
* 3b35d4542c ("block: Add a 'flags' param to blk_pread()")
* bf5b16fa40 ("block: Make blk_{pread,pwrite}() return 0 on success")
Those changes especially affected the qemu-img dd patches, because
the context also changed, but also some of our block drivers used
the functions.
* Drop qemu-common.h include: it got renamed after essentially
everything was moved to other headers. The only remaining user I
could find for things dropped from the header between 7.0 and 7.1
was qemu_get_vm_name() in the iscsi-initiatorname patch, but it
already includes the header to which the function was moved.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2022-10-14 15:07:13 +03:00
|
|
|
+ size_t bs_pos = 0;
|
2017-08-07 10:10:07 +03:00
|
|
|
+ int ret = -EINVAL;
|
2017-04-05 11:49:19 +03:00
|
|
|
+
|
2017-08-07 10:10:07 +03:00
|
|
|
+ be = blk_new_open(filename, NULL, NULL, 0, &local_err);
|
2017-04-05 11:49:19 +03:00
|
|
|
+
|
2017-08-07 10:10:07 +03:00
|
|
|
+ if (!be) {
|
2018-02-22 14:34:57 +03:00
|
|
|
+ error_setg(errp, "Could not open VM state file");
|
2017-04-05 11:49:19 +03:00
|
|
|
+ goto the_end;
|
|
|
|
+ }
|
|
|
|
+
|
2017-08-07 10:10:07 +03:00
|
|
|
+ error_setg(&blocker, "block device is in use by load state");
|
|
|
|
+ blk_op_block_all(be, blocker);
|
|
|
|
+
|
2017-04-05 11:49:19 +03:00
|
|
|
+ /* restore the VM state */
|
update submodule and patches to 7.1.0
Notable changes:
* The only big change is the switch to using a custom QIOChannel for
savevm-async, because the previously used QEMUFileOps was dropped.
Changes to the current implementation:
* Switch to vector based methods as required for an IO channel. For
short reads the passed-in IO vector is stuffed with zeroes at the
end, just to be sure.
* For reading: The documentation in include/io/channel.h states that
at least one byte should be read, so also error out when whe are
at the very end instead of returning 0.
* For reading: Fix off-by-one error when request goes beyond end.
The wrong code piece was:
if ((pos + size) > maxlen) {
size = maxlen - pos - 1;
}
Previously, the last byte would not be read. It's actually
possible to get a snapshot .raw file that has content all the way
up the final 512 byte (= BDRV_SECTOR_SIZE) boundary without any
trailing zero bytes (I wrote a script to do it).
Luckily, it didn't cause a real issue, because qemu_loadvm_state()
is not interested in the final (i.e. QEMU_VM_VMDESCRIPTION)
section. The buffer for reading it is simply freed up afterwards
and the function will assume that it read the whole section, even
if that's not the case.
* For writing: Make use of the generated blk_pwritev() wrapper
instead of manually wrapping the coroutine to simplify and save a
few lines.
* Adapt to changed interfaces for blk_{pread,pwrite}:
* a9262f551e ("block: Change blk_{pread,pwrite}() param order")
* 3b35d4542c ("block: Add a 'flags' param to blk_pread()")
* bf5b16fa40 ("block: Make blk_{pread,pwrite}() return 0 on success")
Those changes especially affected the qemu-img dd patches, because
the context also changed, but also some of our block drivers used
the functions.
* Drop qemu-common.h include: it got renamed after essentially
everything was moved to other headers. The only remaining user I
could find for things dropped from the header between 7.0 and 7.1
was qemu_get_vm_name() in the iscsi-initiatorname patch, but it
already includes the header to which the function was moved.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2022-10-14 15:07:13 +03:00
|
|
|
+ f = qemu_file_new_input(QIO_CHANNEL(qio_channel_savevm_async_new(be, &bs_pos)));
|
2017-04-05 11:49:19 +03:00
|
|
|
+ if (!f) {
|
2018-02-22 14:34:57 +03:00
|
|
|
+ error_setg(errp, "Could not open VM state file");
|
2017-04-05 11:49:19 +03:00
|
|
|
+ goto the_end;
|
|
|
|
+ }
|
|
|
|
+
|
2018-02-22 14:34:57 +03:00
|
|
|
+ qemu_system_reset(SHUTDOWN_CAUSE_NONE);
|
2017-04-05 11:49:19 +03:00
|
|
|
+ ret = qemu_loadvm_state(f);
|
|
|
|
+
|
2021-03-16 19:30:22 +03:00
|
|
|
+ /* dirty bitmap migration has a special case we need to trigger manually */
|
|
|
|
+ dirty_bitmap_mig_before_vm_start();
|
|
|
|
+
|
2017-04-05 11:49:19 +03:00
|
|
|
+ qemu_fclose(f);
|
squash related patches
where there is no good reason to keep them separate. It's a pain
during rebase if there are multiple patches changing the same code
over and over again. This was especially bad for the backup-related
patches. If the history of patches really is needed, it can be
extracted via git. Additionally, compilation with partial application
of patches was broken since a long time, because one of the master key
changes became part of an earlier patch during a past rebase.
If only the same files were changed by a subsequent patch and the
changes felt to belong together (obvious for later bug fixes, but also
done for features e.g. adding master key support for PBS), the patches
were squashed together.
The PBS namespace support patch was split into the individual parts
it changes, i.e. PBS block driver, pbs-restore binary and QMP backup
infrastructure, and squashed into the respective patches.
No code change is intended, git diff in the submodule should not show
any difference between applying all patches before this commit and
applying all patches after this commit.
The query-proxmox-support QMP function has been left as part of the
"PVE-Backup: Proxmox backup patches for QEMU" patch, because it's
currently only used there. If it ever is used elsewhere too, it can
be split out from there.
The recent alloc-track and BQL-related savevm-async changes have been
left separate for now, because it's not 100% clear they are the best
approach yet. This depends on what upstream decides about the BQL
stuff and whether and what kind of issues with the changes pop up.
The qemu-img dd snapshot patch has been re-ordered to after the other
qemu-img dd patches.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2023-05-15 16:39:56 +03:00
|
|
|
+
|
|
|
|
+ /* state_destroy assumes a real migration which would have added a yank */
|
|
|
|
+ yank_register_instance(MIGRATION_YANK_INSTANCE, &error_abort);
|
|
|
|
+
|
2017-04-05 11:49:19 +03:00
|
|
|
+ migration_incoming_state_destroy();
|
|
|
|
+ if (ret < 0) {
|
2018-02-22 14:34:57 +03:00
|
|
|
+ error_setg_errno(errp, -ret, "Error while loading VM state");
|
2017-04-05 11:49:19 +03:00
|
|
|
+ goto the_end;
|
|
|
|
+ }
|
|
|
|
+
|
|
|
|
+ ret = 0;
|
|
|
|
+
|
|
|
|
+ the_end:
|
2017-08-07 10:10:07 +03:00
|
|
|
+ if (be) {
|
|
|
|
+ blk_op_unblock_all(be, blocker);
|
2017-04-05 11:49:19 +03:00
|
|
|
+ error_free(blocker);
|
2017-08-07 10:10:07 +03:00
|
|
|
+ blk_unref(be);
|
2017-04-05 11:49:19 +03:00
|
|
|
+ }
|
|
|
|
+ return ret;
|
|
|
|
+}
|
2021-02-11 19:11:11 +03:00
|
|
|
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
|
update submodule and patches to QEMU 8.2.2
This version includes both the AioContext lock and the block graph
lock, so there might be some deadlocks lurking. It's not possible to
disable the block graph lock like was done in QEMU 8.1, because there
are no changes like the function bdrv_schedule_unref() that require
it. QEMU 9.0 will finally get rid of the AioContext locking.
During live-restore with a VirtIO SCSI drive with iothread there is a
known racy deadlock related to the AioContext lock. Not new [1], but
not sure if more likely now. Should be fixed in QEMU 9.0.
The block graph lock comes with annotations that can be checked by
clang's TSA. This required changes to the block drivers, i.e.
alloc-track, pbs, zeroinit as well as taking the appropriate locks
in pve-backup, savevm-async, vma-reader.
Local variable shadowing is prohibited via a compiler flag now,
required slight adaptation in vma.c.
Major changes only affect alloc-track:
* It is not possible to call a generated co-wrapper like
bdrv_get_info() while holding the block graph lock exclusively [0],
which does happen during initialization of alloc-track when the
backing hd is set and the refresh_limits driver callback is invoked.
The bdrv_get_info() call to get the cluster size is moved to
directly after opening the file child in track_open().
The important thing is that at least the request alignment for the
write target is used, because then the RMW cycle in bdrv_pwritev
will gather enough data from the backing file. Partial cluster
allocations in the target are not a fundamental issue, because the
driver returns its allocation status based on the bitmap, so any
other data that maps to the same cluster will still be copied later
by a stream job (or during writes to that cluster).
* Replacing the node cannot be done in the
track_co_change_backing_file() callback, because it is a coroutine
and cannot hold the block graph lock exclusively. So it is moved to
the stream job itself with the auto-remove option not having an
effect anymore (qemu-server would always set it anyways).
In the future, there could either be a special option for the stream
job, or maybe the upcoming blockdev-replace QMP command can be used.
Replacing the backing child is actually already done in the stream
job, so no need to do it in the track_co_change_backing_file()
callback. It also cannot be called from a coroutine. Looking at the
implementation in the qcow2 driver, it doesn't seem to be intended
to change the backing child itself, just update driver-internal
state.
Other changes:
* alloc-track: Error out early when used without auto-remove. Since
replacing the node now happens in the stream job, where the option
cannot be read from (it's internal to the driver), it will always be
treated as 'on'. Makes sure to have users beside qemu-server notice
the change (should they even exist). The option can be fully dropped
in the future while adding a version guard in qemu-server.
* alloc-track: Avoid seemingly superfluous child permission update.
Doesn't seem necessary nowadays (maybe after commit "alloc-track:
fix deadlock during drop" where the dropping is not rescheduled and
delayed anymore or some upstream change). Replacing the block node
will already update the permissions of the new node (which was the
file child before). Should there really be some issue, instead of
having a drop state, this could also be just based off the fact
whether there is still a backing child.
Dumping the cumulative (shared) permissions for the BDS with a debug
print yields the same values after this patch and with QEMU 8.1,
namely 3 and 5.
* PBS block driver: compile unconditionally. Proxmox VE always needs
it and something in the build process changed to make it not enabled
by default. Probably would need to move the build option to meson
otherwise.
* backup: job unreferencing during cleanup needs to happen outside of
coroutine, so it was moved to before invoking the clean
* mirror: Cherry-pick stable fix to avoid potential deadlock.
* savevm-async: migrate_init now can fail, so propagate potential
error.
* savevm-async: compression counters are not accessible outside
migration/ram-compress now, so drop code that prophylactically set
it to zero.
[0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/
[1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
|
|
|
index 871898ac46..ef4634e5c1 100644
|
2021-02-11 19:11:11 +03:00
|
|
|
--- a/monitor/hmp-cmds.c
|
|
|
|
+++ b/monitor/hmp-cmds.c
|
update submodule and patches to QEMU 8.0.0
Many changes were necessary this time around:
* QAPI was changed to avoid redundant has_* variables, see commit
44ea9d9be3 ("qapi: Start to elide redundant has_FOO in generated C")
for details. This affected many QMP commands added by Proxmox too.
* Pending querying for migration got split into two functions, one to
estimate, one for exact value, see commit c8df4a7aef ("migration:
Split save_live_pending() into state_pending_*") for details. Relevant
for savevm-async and PBS dirty bitmap.
* Some block (driver) functions got converted to coroutines, so the
Proxmox block drivers needed to be adapted.
* Alloc track auto-detaching during PBS live restore got broken by
AioContext-related changes resulting in a deadlock. The current, hacky
method was replaced by a simpler one. Stefan apparently ran into a
problem with that when he wrote the driver, but there were
improvements in the stream job code since then and I didn't manage to
reproduce the issue. It's a separate patch "alloc-track: fix deadlock
during drop" for now, you can find the details there.
* Async snapshot-related changes:
- The pending querying got adapted to the above-mentioned split and
a patch is added to optimize it/make it more similar to what
upstream code does.
- Added initialization of the compression counters (for
future-proofing).
- It's necessary the hold the BQL (big QEMU lock = iothread mutex)
during the setup phase, because block layer functions are used there
and not doing so leads to racy, hard-to-debug crashes or hangs. It's
necessary to change some upstream code too for this, a version of
the patch "migration: for snapshots, hold the BQL during setup
callbacks" is intended to be upstreamed.
- Need to take the bdrv graph read lock before flushing.
* hmp_info_balloon was moved to a different file.
* Needed to include a new headers from time to time to still get the
correct functions.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2023-05-15 16:39:53 +03:00
|
|
|
@@ -22,6 +22,7 @@
|
|
|
|
#include "monitor/monitor-internal.h"
|
|
|
|
#include "qapi/error.h"
|
|
|
|
#include "qapi/qapi-commands-control.h"
|
|
|
|
+#include "qapi/qapi-commands-migration.h"
|
|
|
|
#include "qapi/qapi-commands-misc.h"
|
|
|
|
#include "qapi/qmp/qdict.h"
|
update submodule and patches to QEMU 8.2.2
This version includes both the AioContext lock and the block graph
lock, so there might be some deadlocks lurking. It's not possible to
disable the block graph lock like was done in QEMU 8.1, because there
are no changes like the function bdrv_schedule_unref() that require
it. QEMU 9.0 will finally get rid of the AioContext locking.
During live-restore with a VirtIO SCSI drive with iothread there is a
known racy deadlock related to the AioContext lock. Not new [1], but
not sure if more likely now. Should be fixed in QEMU 9.0.
The block graph lock comes with annotations that can be checked by
clang's TSA. This required changes to the block drivers, i.e.
alloc-track, pbs, zeroinit as well as taking the appropriate locks
in pve-backup, savevm-async, vma-reader.
Local variable shadowing is prohibited via a compiler flag now,
required slight adaptation in vma.c.
Major changes only affect alloc-track:
* It is not possible to call a generated co-wrapper like
bdrv_get_info() while holding the block graph lock exclusively [0],
which does happen during initialization of alloc-track when the
backing hd is set and the refresh_limits driver callback is invoked.
The bdrv_get_info() call to get the cluster size is moved to
directly after opening the file child in track_open().
The important thing is that at least the request alignment for the
write target is used, because then the RMW cycle in bdrv_pwritev
will gather enough data from the backing file. Partial cluster
allocations in the target are not a fundamental issue, because the
driver returns its allocation status based on the bitmap, so any
other data that maps to the same cluster will still be copied later
by a stream job (or during writes to that cluster).
* Replacing the node cannot be done in the
track_co_change_backing_file() callback, because it is a coroutine
and cannot hold the block graph lock exclusively. So it is moved to
the stream job itself with the auto-remove option not having an
effect anymore (qemu-server would always set it anyways).
In the future, there could either be a special option for the stream
job, or maybe the upcoming blockdev-replace QMP command can be used.
Replacing the backing child is actually already done in the stream
job, so no need to do it in the track_co_change_backing_file()
callback. It also cannot be called from a coroutine. Looking at the
implementation in the qcow2 driver, it doesn't seem to be intended
to change the backing child itself, just update driver-internal
state.
Other changes:
* alloc-track: Error out early when used without auto-remove. Since
replacing the node now happens in the stream job, where the option
cannot be read from (it's internal to the driver), it will always be
treated as 'on'. Makes sure to have users beside qemu-server notice
the change (should they even exist). The option can be fully dropped
in the future while adding a version guard in qemu-server.
* alloc-track: Avoid seemingly superfluous child permission update.
Doesn't seem necessary nowadays (maybe after commit "alloc-track:
fix deadlock during drop" where the dropping is not rescheduled and
delayed anymore or some upstream change). Replacing the block node
will already update the permissions of the new node (which was the
file child before). Should there really be some issue, instead of
having a drop state, this could also be just based off the fact
whether there is still a backing child.
Dumping the cumulative (shared) permissions for the BDS with a debug
print yields the same values after this patch and with QEMU 8.1,
namely 3 and 5.
* PBS block driver: compile unconditionally. Proxmox VE always needs
it and something in the build process changed to make it not enabled
by default. Probably would need to move the build option to meson
otherwise.
* backup: job unreferencing during cleanup needs to happen outside of
coroutine, so it was moved to before invoking the clean
* mirror: Cherry-pick stable fix to avoid potential deadlock.
* savevm-async: migrate_init now can fail, so propagate potential
error.
* savevm-async: compression counters are not accessible outside
migration/ram-compress now, so drop code that prophylactically set
it to zero.
[0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/
[1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
|
|
|
#include "qemu/cutils.h"
|
2023-05-24 16:56:53 +03:00
|
|
|
@@ -443,3 +444,40 @@ void hmp_info_mtree(Monitor *mon, const QDict *qdict)
|
2021-02-11 19:11:11 +03:00
|
|
|
|
update submodule and patches to QEMU 8.0.0
Many changes were necessary this time around:
* QAPI was changed to avoid redundant has_* variables, see commit
44ea9d9be3 ("qapi: Start to elide redundant has_FOO in generated C")
for details. This affected many QMP commands added by Proxmox too.
* Pending querying for migration got split into two functions, one to
estimate, one for exact value, see commit c8df4a7aef ("migration:
Split save_live_pending() into state_pending_*") for details. Relevant
for savevm-async and PBS dirty bitmap.
* Some block (driver) functions got converted to coroutines, so the
Proxmox block drivers needed to be adapted.
* Alloc track auto-detaching during PBS live restore got broken by
AioContext-related changes resulting in a deadlock. The current, hacky
method was replaced by a simpler one. Stefan apparently ran into a
problem with that when he wrote the driver, but there were
improvements in the stream job code since then and I didn't manage to
reproduce the issue. It's a separate patch "alloc-track: fix deadlock
during drop" for now, you can find the details there.
* Async snapshot-related changes:
- The pending querying got adapted to the above-mentioned split and
a patch is added to optimize it/make it more similar to what
upstream code does.
- Added initialization of the compression counters (for
future-proofing).
- It's necessary the hold the BQL (big QEMU lock = iothread mutex)
during the setup phase, because block layer functions are used there
and not doing so leads to racy, hard-to-debug crashes or hangs. It's
necessary to change some upstream code too for this, a version of
the patch "migration: for snapshots, hold the BQL during setup
callbacks" is intended to be upstreamed.
- Need to take the bdrv graph read lock before flushing.
* hmp_info_balloon was moved to a different file.
* Needed to include a new headers from time to time to still get the
correct functions.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2023-05-15 16:39:53 +03:00
|
|
|
mtree_info(flatview, dispatch_tree, owner, disabled);
|
|
|
|
}
|
|
|
|
+
|
2021-02-11 19:11:11 +03:00
|
|
|
+void hmp_savevm_start(Monitor *mon, const QDict *qdict)
|
|
|
|
+{
|
|
|
|
+ Error *errp = NULL;
|
|
|
|
+ const char *statefile = qdict_get_try_str(qdict, "statefile");
|
|
|
|
+
|
update submodule and patches to QEMU 8.0.0
Many changes were necessary this time around:
* QAPI was changed to avoid redundant has_* variables, see commit
44ea9d9be3 ("qapi: Start to elide redundant has_FOO in generated C")
for details. This affected many QMP commands added by Proxmox too.
* Pending querying for migration got split into two functions, one to
estimate, one for exact value, see commit c8df4a7aef ("migration:
Split save_live_pending() into state_pending_*") for details. Relevant
for savevm-async and PBS dirty bitmap.
* Some block (driver) functions got converted to coroutines, so the
Proxmox block drivers needed to be adapted.
* Alloc track auto-detaching during PBS live restore got broken by
AioContext-related changes resulting in a deadlock. The current, hacky
method was replaced by a simpler one. Stefan apparently ran into a
problem with that when he wrote the driver, but there were
improvements in the stream job code since then and I didn't manage to
reproduce the issue. It's a separate patch "alloc-track: fix deadlock
during drop" for now, you can find the details there.
* Async snapshot-related changes:
- The pending querying got adapted to the above-mentioned split and
a patch is added to optimize it/make it more similar to what
upstream code does.
- Added initialization of the compression counters (for
future-proofing).
- It's necessary the hold the BQL (big QEMU lock = iothread mutex)
during the setup phase, because block layer functions are used there
and not doing so leads to racy, hard-to-debug crashes or hangs. It's
necessary to change some upstream code too for this, a version of
the patch "migration: for snapshots, hold the BQL during setup
callbacks" is intended to be upstreamed.
- Need to take the bdrv graph read lock before flushing.
* hmp_info_balloon was moved to a different file.
* Needed to include a new headers from time to time to still get the
correct functions.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2023-05-15 16:39:53 +03:00
|
|
|
+ qmp_savevm_start(statefile, &errp);
|
2021-02-11 19:11:11 +03:00
|
|
|
+ hmp_handle_error(mon, errp);
|
|
|
|
+}
|
|
|
|
+
|
|
|
|
+void coroutine_fn hmp_savevm_end(Monitor *mon, const QDict *qdict)
|
|
|
|
+{
|
|
|
|
+ Error *errp = NULL;
|
|
|
|
+
|
|
|
|
+ qmp_savevm_end(&errp);
|
|
|
|
+ hmp_handle_error(mon, errp);
|
|
|
|
+}
|
|
|
|
+
|
|
|
|
+void hmp_info_savevm(Monitor *mon, const QDict *qdict)
|
|
|
|
+{
|
|
|
|
+ SaveVMInfo *info;
|
|
|
|
+ info = qmp_query_savevm(NULL);
|
|
|
|
+
|
update submodule and patches to QEMU 8.0.0
Many changes were necessary this time around:
* QAPI was changed to avoid redundant has_* variables, see commit
44ea9d9be3 ("qapi: Start to elide redundant has_FOO in generated C")
for details. This affected many QMP commands added by Proxmox too.
* Pending querying for migration got split into two functions, one to
estimate, one for exact value, see commit c8df4a7aef ("migration:
Split save_live_pending() into state_pending_*") for details. Relevant
for savevm-async and PBS dirty bitmap.
* Some block (driver) functions got converted to coroutines, so the
Proxmox block drivers needed to be adapted.
* Alloc track auto-detaching during PBS live restore got broken by
AioContext-related changes resulting in a deadlock. The current, hacky
method was replaced by a simpler one. Stefan apparently ran into a
problem with that when he wrote the driver, but there were
improvements in the stream job code since then and I didn't manage to
reproduce the issue. It's a separate patch "alloc-track: fix deadlock
during drop" for now, you can find the details there.
* Async snapshot-related changes:
- The pending querying got adapted to the above-mentioned split and
a patch is added to optimize it/make it more similar to what
upstream code does.
- Added initialization of the compression counters (for
future-proofing).
- It's necessary the hold the BQL (big QEMU lock = iothread mutex)
during the setup phase, because block layer functions are used there
and not doing so leads to racy, hard-to-debug crashes or hangs. It's
necessary to change some upstream code too for this, a version of
the patch "migration: for snapshots, hold the BQL during setup
callbacks" is intended to be upstreamed.
- Need to take the bdrv graph read lock before flushing.
* hmp_info_balloon was moved to a different file.
* Needed to include a new headers from time to time to still get the
correct functions.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2023-05-15 16:39:53 +03:00
|
|
|
+ if (info->status) {
|
2021-02-11 19:11:11 +03:00
|
|
|
+ monitor_printf(mon, "savevm status: %s\n", info->status);
|
|
|
|
+ monitor_printf(mon, "total time: %" PRIu64 " milliseconds\n",
|
|
|
|
+ info->total_time);
|
|
|
|
+ } else {
|
|
|
|
+ monitor_printf(mon, "savevm status: not running\n");
|
|
|
|
+ }
|
|
|
|
+ if (info->has_bytes) {
|
|
|
|
+ monitor_printf(mon, "Bytes saved: %"PRIu64"\n", info->bytes);
|
|
|
|
+ }
|
update submodule and patches to QEMU 8.0.0
Many changes were necessary this time around:
* QAPI was changed to avoid redundant has_* variables, see commit
44ea9d9be3 ("qapi: Start to elide redundant has_FOO in generated C")
for details. This affected many QMP commands added by Proxmox too.
* Pending querying for migration got split into two functions, one to
estimate, one for exact value, see commit c8df4a7aef ("migration:
Split save_live_pending() into state_pending_*") for details. Relevant
for savevm-async and PBS dirty bitmap.
* Some block (driver) functions got converted to coroutines, so the
Proxmox block drivers needed to be adapted.
* Alloc track auto-detaching during PBS live restore got broken by
AioContext-related changes resulting in a deadlock. The current, hacky
method was replaced by a simpler one. Stefan apparently ran into a
problem with that when he wrote the driver, but there were
improvements in the stream job code since then and I didn't manage to
reproduce the issue. It's a separate patch "alloc-track: fix deadlock
during drop" for now, you can find the details there.
* Async snapshot-related changes:
- The pending querying got adapted to the above-mentioned split and
a patch is added to optimize it/make it more similar to what
upstream code does.
- Added initialization of the compression counters (for
future-proofing).
- It's necessary the hold the BQL (big QEMU lock = iothread mutex)
during the setup phase, because block layer functions are used there
and not doing so leads to racy, hard-to-debug crashes or hangs. It's
necessary to change some upstream code too for this, a version of
the patch "migration: for snapshots, hold the BQL during setup
callbacks" is intended to be upstreamed.
- Need to take the bdrv graph read lock before flushing.
* hmp_info_balloon was moved to a different file.
* Needed to include a new headers from time to time to still get the
correct functions.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2023-05-15 16:39:53 +03:00
|
|
|
+ if (info->error) {
|
2021-02-11 19:11:11 +03:00
|
|
|
+ monitor_printf(mon, "Error: %s\n", info->error);
|
|
|
|
+ }
|
|
|
|
+}
|
|
|
|
diff --git a/qapi/migration.json b/qapi/migration.json
|
update submodule and patches to QEMU 8.2.2
This version includes both the AioContext lock and the block graph
lock, so there might be some deadlocks lurking. It's not possible to
disable the block graph lock like was done in QEMU 8.1, because there
are no changes like the function bdrv_schedule_unref() that require
it. QEMU 9.0 will finally get rid of the AioContext locking.
During live-restore with a VirtIO SCSI drive with iothread there is a
known racy deadlock related to the AioContext lock. Not new [1], but
not sure if more likely now. Should be fixed in QEMU 9.0.
The block graph lock comes with annotations that can be checked by
clang's TSA. This required changes to the block drivers, i.e.
alloc-track, pbs, zeroinit as well as taking the appropriate locks
in pve-backup, savevm-async, vma-reader.
Local variable shadowing is prohibited via a compiler flag now,
required slight adaptation in vma.c.
Major changes only affect alloc-track:
* It is not possible to call a generated co-wrapper like
bdrv_get_info() while holding the block graph lock exclusively [0],
which does happen during initialization of alloc-track when the
backing hd is set and the refresh_limits driver callback is invoked.
The bdrv_get_info() call to get the cluster size is moved to
directly after opening the file child in track_open().
The important thing is that at least the request alignment for the
write target is used, because then the RMW cycle in bdrv_pwritev
will gather enough data from the backing file. Partial cluster
allocations in the target are not a fundamental issue, because the
driver returns its allocation status based on the bitmap, so any
other data that maps to the same cluster will still be copied later
by a stream job (or during writes to that cluster).
* Replacing the node cannot be done in the
track_co_change_backing_file() callback, because it is a coroutine
and cannot hold the block graph lock exclusively. So it is moved to
the stream job itself with the auto-remove option not having an
effect anymore (qemu-server would always set it anyways).
In the future, there could either be a special option for the stream
job, or maybe the upcoming blockdev-replace QMP command can be used.
Replacing the backing child is actually already done in the stream
job, so no need to do it in the track_co_change_backing_file()
callback. It also cannot be called from a coroutine. Looking at the
implementation in the qcow2 driver, it doesn't seem to be intended
to change the backing child itself, just update driver-internal
state.
Other changes:
* alloc-track: Error out early when used without auto-remove. Since
replacing the node now happens in the stream job, where the option
cannot be read from (it's internal to the driver), it will always be
treated as 'on'. Makes sure to have users beside qemu-server notice
the change (should they even exist). The option can be fully dropped
in the future while adding a version guard in qemu-server.
* alloc-track: Avoid seemingly superfluous child permission update.
Doesn't seem necessary nowadays (maybe after commit "alloc-track:
fix deadlock during drop" where the dropping is not rescheduled and
delayed anymore or some upstream change). Replacing the block node
will already update the permissions of the new node (which was the
file child before). Should there really be some issue, instead of
having a drop state, this could also be just based off the fact
whether there is still a backing child.
Dumping the cumulative (shared) permissions for the BDS with a debug
print yields the same values after this patch and with QEMU 8.1,
namely 3 and 5.
* PBS block driver: compile unconditionally. Proxmox VE always needs
it and something in the build process changed to make it not enabled
by default. Probably would need to move the build option to meson
otherwise.
* backup: job unreferencing during cleanup needs to happen outside of
coroutine, so it was moved to before invoking the clean
* mirror: Cherry-pick stable fix to avoid potential deadlock.
* savevm-async: migrate_init now can fail, so propagate potential
error.
* savevm-async: compression counters are not accessible outside
migration/ram-compress now, so drop code that prophylactically set
it to zero.
[0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/
[1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
|
|
|
index 197d3faa43..b41465fbe9 100644
|
2021-02-11 19:11:11 +03:00
|
|
|
--- a/qapi/migration.json
|
|
|
|
+++ b/qapi/migration.json
|
update submodule and patches to QEMU 8.2.2
This version includes both the AioContext lock and the block graph
lock, so there might be some deadlocks lurking. It's not possible to
disable the block graph lock like was done in QEMU 8.1, because there
are no changes like the function bdrv_schedule_unref() that require
it. QEMU 9.0 will finally get rid of the AioContext locking.
During live-restore with a VirtIO SCSI drive with iothread there is a
known racy deadlock related to the AioContext lock. Not new [1], but
not sure if more likely now. Should be fixed in QEMU 9.0.
The block graph lock comes with annotations that can be checked by
clang's TSA. This required changes to the block drivers, i.e.
alloc-track, pbs, zeroinit as well as taking the appropriate locks
in pve-backup, savevm-async, vma-reader.
Local variable shadowing is prohibited via a compiler flag now,
required slight adaptation in vma.c.
Major changes only affect alloc-track:
* It is not possible to call a generated co-wrapper like
bdrv_get_info() while holding the block graph lock exclusively [0],
which does happen during initialization of alloc-track when the
backing hd is set and the refresh_limits driver callback is invoked.
The bdrv_get_info() call to get the cluster size is moved to
directly after opening the file child in track_open().
The important thing is that at least the request alignment for the
write target is used, because then the RMW cycle in bdrv_pwritev
will gather enough data from the backing file. Partial cluster
allocations in the target are not a fundamental issue, because the
driver returns its allocation status based on the bitmap, so any
other data that maps to the same cluster will still be copied later
by a stream job (or during writes to that cluster).
* Replacing the node cannot be done in the
track_co_change_backing_file() callback, because it is a coroutine
and cannot hold the block graph lock exclusively. So it is moved to
the stream job itself with the auto-remove option not having an
effect anymore (qemu-server would always set it anyways).
In the future, there could either be a special option for the stream
job, or maybe the upcoming blockdev-replace QMP command can be used.
Replacing the backing child is actually already done in the stream
job, so no need to do it in the track_co_change_backing_file()
callback. It also cannot be called from a coroutine. Looking at the
implementation in the qcow2 driver, it doesn't seem to be intended
to change the backing child itself, just update driver-internal
state.
Other changes:
* alloc-track: Error out early when used without auto-remove. Since
replacing the node now happens in the stream job, where the option
cannot be read from (it's internal to the driver), it will always be
treated as 'on'. Makes sure to have users beside qemu-server notice
the change (should they even exist). The option can be fully dropped
in the future while adding a version guard in qemu-server.
* alloc-track: Avoid seemingly superfluous child permission update.
Doesn't seem necessary nowadays (maybe after commit "alloc-track:
fix deadlock during drop" where the dropping is not rescheduled and
delayed anymore or some upstream change). Replacing the block node
will already update the permissions of the new node (which was the
file child before). Should there really be some issue, instead of
having a drop state, this could also be just based off the fact
whether there is still a backing child.
Dumping the cumulative (shared) permissions for the BDS with a debug
print yields the same values after this patch and with QEMU 8.1,
namely 3 and 5.
* PBS block driver: compile unconditionally. Proxmox VE always needs
it and something in the build process changed to make it not enabled
by default. Probably would need to move the build option to meson
otherwise.
* backup: job unreferencing during cleanup needs to happen outside of
coroutine, so it was moved to before invoking the clean
* mirror: Cherry-pick stable fix to avoid potential deadlock.
* savevm-async: migrate_init now can fail, so propagate potential
error.
* savevm-async: compression counters are not accessible outside
migration/ram-compress now, so drop code that prophylactically set
it to zero.
[0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/
[1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
|
|
|
@@ -298,6 +298,40 @@
|
2023-10-17 15:10:09 +03:00
|
|
|
'*dirty-limit-throttle-time-per-round': 'uint64',
|
|
|
|
'*dirty-limit-ring-full-time': 'uint64'} }
|
2021-02-11 19:11:11 +03:00
|
|
|
|
|
|
|
+##
|
|
|
|
+# @SaveVMInfo:
|
|
|
|
+#
|
|
|
|
+# Information about current migration process.
|
|
|
|
+#
|
|
|
|
+# @status: string describing the current savevm status.
|
|
|
|
+# This can be 'active', 'completed', 'failed'.
|
|
|
|
+# If this field is not returned, no savevm process
|
|
|
|
+# has been initiated
|
|
|
|
+#
|
|
|
|
+# @error: string containing error message is status is failed.
|
|
|
|
+#
|
|
|
|
+# @total-time: total amount of milliseconds since savevm started.
|
|
|
|
+# If savevm has ended, it returns the total save time
|
|
|
|
+#
|
|
|
|
+# @bytes: total amount of data transfered
|
|
|
|
+#
|
|
|
|
+# Since: 1.3
|
|
|
|
+##
|
|
|
|
+{ 'struct': 'SaveVMInfo',
|
|
|
|
+ 'data': {'*status': 'str', '*error': 'str',
|
|
|
|
+ '*total-time': 'int', '*bytes': 'int'} }
|
|
|
|
+
|
|
|
|
+##
|
|
|
|
+# @query-savevm:
|
|
|
|
+#
|
|
|
|
+# Returns information about current savevm process.
|
|
|
|
+#
|
|
|
|
+# Returns: @SaveVMInfo
|
|
|
|
+#
|
|
|
|
+# Since: 1.3
|
|
|
|
+##
|
|
|
|
+{ 'command': 'query-savevm', 'returns': 'SaveVMInfo' }
|
|
|
|
+
|
|
|
|
##
|
|
|
|
# @query-migrate:
|
|
|
|
#
|
|
|
|
diff --git a/qapi/misc.json b/qapi/misc.json
|
2023-10-17 15:10:09 +03:00
|
|
|
index cda2effa81..94a58bb0bf 100644
|
2021-02-11 19:11:11 +03:00
|
|
|
--- a/qapi/misc.json
|
|
|
|
+++ b/qapi/misc.json
|
2023-10-17 15:10:09 +03:00
|
|
|
@@ -456,6 +456,22 @@
|
2021-02-11 19:11:11 +03:00
|
|
|
##
|
|
|
|
{ 'command': 'query-fdsets', 'returns': ['FdsetInfo'] }
|
|
|
|
|
|
|
|
+##
|
|
|
|
+# @savevm-start:
|
|
|
|
+#
|
|
|
|
+# Prepare for snapshot and halt VM. Save VM state to statefile.
|
|
|
|
+#
|
|
|
|
+##
|
|
|
|
+{ 'command': 'savevm-start', 'data': { '*statefile': 'str' } }
|
|
|
|
+
|
|
|
|
+##
|
|
|
|
+# @savevm-end:
|
|
|
|
+#
|
|
|
|
+# Resume VM after a snapshot.
|
|
|
|
+#
|
|
|
|
+##
|
|
|
|
+{ 'command': 'savevm-end', 'coroutine': true }
|
|
|
|
+
|
|
|
|
##
|
|
|
|
# @CommandLineParameterType:
|
|
|
|
#
|
|
|
|
diff --git a/qemu-options.hx b/qemu-options.hx
|
update submodule and patches to QEMU 8.2.2
This version includes both the AioContext lock and the block graph
lock, so there might be some deadlocks lurking. It's not possible to
disable the block graph lock like was done in QEMU 8.1, because there
are no changes like the function bdrv_schedule_unref() that require
it. QEMU 9.0 will finally get rid of the AioContext locking.
During live-restore with a VirtIO SCSI drive with iothread there is a
known racy deadlock related to the AioContext lock. Not new [1], but
not sure if more likely now. Should be fixed in QEMU 9.0.
The block graph lock comes with annotations that can be checked by
clang's TSA. This required changes to the block drivers, i.e.
alloc-track, pbs, zeroinit as well as taking the appropriate locks
in pve-backup, savevm-async, vma-reader.
Local variable shadowing is prohibited via a compiler flag now,
required slight adaptation in vma.c.
Major changes only affect alloc-track:
* It is not possible to call a generated co-wrapper like
bdrv_get_info() while holding the block graph lock exclusively [0],
which does happen during initialization of alloc-track when the
backing hd is set and the refresh_limits driver callback is invoked.
The bdrv_get_info() call to get the cluster size is moved to
directly after opening the file child in track_open().
The important thing is that at least the request alignment for the
write target is used, because then the RMW cycle in bdrv_pwritev
will gather enough data from the backing file. Partial cluster
allocations in the target are not a fundamental issue, because the
driver returns its allocation status based on the bitmap, so any
other data that maps to the same cluster will still be copied later
by a stream job (or during writes to that cluster).
* Replacing the node cannot be done in the
track_co_change_backing_file() callback, because it is a coroutine
and cannot hold the block graph lock exclusively. So it is moved to
the stream job itself with the auto-remove option not having an
effect anymore (qemu-server would always set it anyways).
In the future, there could either be a special option for the stream
job, or maybe the upcoming blockdev-replace QMP command can be used.
Replacing the backing child is actually already done in the stream
job, so no need to do it in the track_co_change_backing_file()
callback. It also cannot be called from a coroutine. Looking at the
implementation in the qcow2 driver, it doesn't seem to be intended
to change the backing child itself, just update driver-internal
state.
Other changes:
* alloc-track: Error out early when used without auto-remove. Since
replacing the node now happens in the stream job, where the option
cannot be read from (it's internal to the driver), it will always be
treated as 'on'. Makes sure to have users beside qemu-server notice
the change (should they even exist). The option can be fully dropped
in the future while adding a version guard in qemu-server.
* alloc-track: Avoid seemingly superfluous child permission update.
Doesn't seem necessary nowadays (maybe after commit "alloc-track:
fix deadlock during drop" where the dropping is not rescheduled and
delayed anymore or some upstream change). Replacing the block node
will already update the permissions of the new node (which was the
file child before). Should there really be some issue, instead of
having a drop state, this could also be just based off the fact
whether there is still a backing child.
Dumping the cumulative (shared) permissions for the BDS with a debug
print yields the same values after this patch and with QEMU 8.1,
namely 3 and 5.
* PBS block driver: compile unconditionally. Proxmox VE always needs
it and something in the build process changed to make it not enabled
by default. Probably would need to move the build option to meson
otherwise.
* backup: job unreferencing during cleanup needs to happen outside of
coroutine, so it was moved to before invoking the clean
* mirror: Cherry-pick stable fix to avoid potential deadlock.
* savevm-async: migrate_init now can fail, so propagate potential
error.
* savevm-async: compression counters are not accessible outside
migration/ram-compress now, so drop code that prophylactically set
it to zero.
[0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/
[1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
|
|
|
index b6b4ad9e67..881b0b3c43 100644
|
2021-02-11 19:11:11 +03:00
|
|
|
--- a/qemu-options.hx
|
|
|
|
+++ b/qemu-options.hx
|
update submodule and patches to QEMU 8.2.2
This version includes both the AioContext lock and the block graph
lock, so there might be some deadlocks lurking. It's not possible to
disable the block graph lock like was done in QEMU 8.1, because there
are no changes like the function bdrv_schedule_unref() that require
it. QEMU 9.0 will finally get rid of the AioContext locking.
During live-restore with a VirtIO SCSI drive with iothread there is a
known racy deadlock related to the AioContext lock. Not new [1], but
not sure if more likely now. Should be fixed in QEMU 9.0.
The block graph lock comes with annotations that can be checked by
clang's TSA. This required changes to the block drivers, i.e.
alloc-track, pbs, zeroinit as well as taking the appropriate locks
in pve-backup, savevm-async, vma-reader.
Local variable shadowing is prohibited via a compiler flag now,
required slight adaptation in vma.c.
Major changes only affect alloc-track:
* It is not possible to call a generated co-wrapper like
bdrv_get_info() while holding the block graph lock exclusively [0],
which does happen during initialization of alloc-track when the
backing hd is set and the refresh_limits driver callback is invoked.
The bdrv_get_info() call to get the cluster size is moved to
directly after opening the file child in track_open().
The important thing is that at least the request alignment for the
write target is used, because then the RMW cycle in bdrv_pwritev
will gather enough data from the backing file. Partial cluster
allocations in the target are not a fundamental issue, because the
driver returns its allocation status based on the bitmap, so any
other data that maps to the same cluster will still be copied later
by a stream job (or during writes to that cluster).
* Replacing the node cannot be done in the
track_co_change_backing_file() callback, because it is a coroutine
and cannot hold the block graph lock exclusively. So it is moved to
the stream job itself with the auto-remove option not having an
effect anymore (qemu-server would always set it anyways).
In the future, there could either be a special option for the stream
job, or maybe the upcoming blockdev-replace QMP command can be used.
Replacing the backing child is actually already done in the stream
job, so no need to do it in the track_co_change_backing_file()
callback. It also cannot be called from a coroutine. Looking at the
implementation in the qcow2 driver, it doesn't seem to be intended
to change the backing child itself, just update driver-internal
state.
Other changes:
* alloc-track: Error out early when used without auto-remove. Since
replacing the node now happens in the stream job, where the option
cannot be read from (it's internal to the driver), it will always be
treated as 'on'. Makes sure to have users beside qemu-server notice
the change (should they even exist). The option can be fully dropped
in the future while adding a version guard in qemu-server.
* alloc-track: Avoid seemingly superfluous child permission update.
Doesn't seem necessary nowadays (maybe after commit "alloc-track:
fix deadlock during drop" where the dropping is not rescheduled and
delayed anymore or some upstream change). Replacing the block node
will already update the permissions of the new node (which was the
file child before). Should there really be some issue, instead of
having a drop state, this could also be just based off the fact
whether there is still a backing child.
Dumping the cumulative (shared) permissions for the BDS with a debug
print yields the same values after this patch and with QEMU 8.1,
namely 3 and 5.
* PBS block driver: compile unconditionally. Proxmox VE always needs
it and something in the build process changed to make it not enabled
by default. Probably would need to move the build option to meson
otherwise.
* backup: job unreferencing during cleanup needs to happen outside of
coroutine, so it was moved to before invoking the clean
* mirror: Cherry-pick stable fix to avoid potential deadlock.
* savevm-async: migrate_init now can fail, so propagate potential
error.
* savevm-async: compression counters are not accessible outside
migration/ram-compress now, so drop code that prophylactically set
it to zero.
[0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/
[1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
|
|
|
@@ -4590,6 +4590,18 @@ SRST
|
2021-02-11 19:11:11 +03:00
|
|
|
Start right away with a saved state (``loadvm`` in monitor)
|
|
|
|
ERST
|
|
|
|
|
|
|
|
+DEF("loadstate", HAS_ARG, QEMU_OPTION_loadstate, \
|
|
|
|
+ "-loadstate file\n" \
|
|
|
|
+ " start right away with a saved state\n",
|
|
|
|
+ QEMU_ARCH_ALL)
|
|
|
|
+SRST
|
|
|
|
+``-loadstate file``
|
|
|
|
+ Start right away with a saved state. This option does not rollback
|
|
|
|
+ disk state like @code{loadvm}, so user must make sure that disk
|
|
|
|
+ have correct state. @var{file} can be any valid device URL. See the section
|
|
|
|
+ for "Device URL Syntax" for more information.
|
|
|
|
+ERST
|
|
|
|
+
|
|
|
|
#ifndef _WIN32
|
|
|
|
DEF("daemonize", 0, QEMU_OPTION_daemonize, \
|
|
|
|
"-daemonize daemonize QEMU after initializing\n", QEMU_ARCH_ALL)
|
update submodule and patches to QEMU 8.2.2
This version includes both the AioContext lock and the block graph
lock, so there might be some deadlocks lurking. It's not possible to
disable the block graph lock like was done in QEMU 8.1, because there
are no changes like the function bdrv_schedule_unref() that require
it. QEMU 9.0 will finally get rid of the AioContext locking.
During live-restore with a VirtIO SCSI drive with iothread there is a
known racy deadlock related to the AioContext lock. Not new [1], but
not sure if more likely now. Should be fixed in QEMU 9.0.
The block graph lock comes with annotations that can be checked by
clang's TSA. This required changes to the block drivers, i.e.
alloc-track, pbs, zeroinit as well as taking the appropriate locks
in pve-backup, savevm-async, vma-reader.
Local variable shadowing is prohibited via a compiler flag now,
required slight adaptation in vma.c.
Major changes only affect alloc-track:
* It is not possible to call a generated co-wrapper like
bdrv_get_info() while holding the block graph lock exclusively [0],
which does happen during initialization of alloc-track when the
backing hd is set and the refresh_limits driver callback is invoked.
The bdrv_get_info() call to get the cluster size is moved to
directly after opening the file child in track_open().
The important thing is that at least the request alignment for the
write target is used, because then the RMW cycle in bdrv_pwritev
will gather enough data from the backing file. Partial cluster
allocations in the target are not a fundamental issue, because the
driver returns its allocation status based on the bitmap, so any
other data that maps to the same cluster will still be copied later
by a stream job (or during writes to that cluster).
* Replacing the node cannot be done in the
track_co_change_backing_file() callback, because it is a coroutine
and cannot hold the block graph lock exclusively. So it is moved to
the stream job itself with the auto-remove option not having an
effect anymore (qemu-server would always set it anyways).
In the future, there could either be a special option for the stream
job, or maybe the upcoming blockdev-replace QMP command can be used.
Replacing the backing child is actually already done in the stream
job, so no need to do it in the track_co_change_backing_file()
callback. It also cannot be called from a coroutine. Looking at the
implementation in the qcow2 driver, it doesn't seem to be intended
to change the backing child itself, just update driver-internal
state.
Other changes:
* alloc-track: Error out early when used without auto-remove. Since
replacing the node now happens in the stream job, where the option
cannot be read from (it's internal to the driver), it will always be
treated as 'on'. Makes sure to have users beside qemu-server notice
the change (should they even exist). The option can be fully dropped
in the future while adding a version guard in qemu-server.
* alloc-track: Avoid seemingly superfluous child permission update.
Doesn't seem necessary nowadays (maybe after commit "alloc-track:
fix deadlock during drop" where the dropping is not rescheduled and
delayed anymore or some upstream change). Replacing the block node
will already update the permissions of the new node (which was the
file child before). Should there really be some issue, instead of
having a drop state, this could also be just based off the fact
whether there is still a backing child.
Dumping the cumulative (shared) permissions for the BDS with a debug
print yields the same values after this patch and with QEMU 8.1,
namely 3 and 5.
* PBS block driver: compile unconditionally. Proxmox VE always needs
it and something in the build process changed to make it not enabled
by default. Probably would need to move the build option to meson
otherwise.
* backup: job unreferencing during cleanup needs to happen outside of
coroutine, so it was moved to before invoking the clean
* mirror: Cherry-pick stable fix to avoid potential deadlock.
* savevm-async: migrate_init now can fail, so propagate potential
error.
* savevm-async: compression counters are not accessible outside
migration/ram-compress now, so drop code that prophylactically set
it to zero.
[0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/
[1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
|
|
|
diff --git a/system/vl.c b/system/vl.c
|
|
|
|
index d2a3b3f457..57f7ba0525 100644
|
|
|
|
--- a/system/vl.c
|
|
|
|
+++ b/system/vl.c
|
|
|
|
@@ -163,6 +163,7 @@ static const char *accelerators;
|
update submodule and patches to 7.1.0
Notable changes:
* The only big change is the switch to using a custom QIOChannel for
savevm-async, because the previously used QEMUFileOps was dropped.
Changes to the current implementation:
* Switch to vector based methods as required for an IO channel. For
short reads the passed-in IO vector is stuffed with zeroes at the
end, just to be sure.
* For reading: The documentation in include/io/channel.h states that
at least one byte should be read, so also error out when whe are
at the very end instead of returning 0.
* For reading: Fix off-by-one error when request goes beyond end.
The wrong code piece was:
if ((pos + size) > maxlen) {
size = maxlen - pos - 1;
}
Previously, the last byte would not be read. It's actually
possible to get a snapshot .raw file that has content all the way
up the final 512 byte (= BDRV_SECTOR_SIZE) boundary without any
trailing zero bytes (I wrote a script to do it).
Luckily, it didn't cause a real issue, because qemu_loadvm_state()
is not interested in the final (i.e. QEMU_VM_VMDESCRIPTION)
section. The buffer for reading it is simply freed up afterwards
and the function will assume that it read the whole section, even
if that's not the case.
* For writing: Make use of the generated blk_pwritev() wrapper
instead of manually wrapping the coroutine to simplify and save a
few lines.
* Adapt to changed interfaces for blk_{pread,pwrite}:
* a9262f551e ("block: Change blk_{pread,pwrite}() param order")
* 3b35d4542c ("block: Add a 'flags' param to blk_pread()")
* bf5b16fa40 ("block: Make blk_{pread,pwrite}() return 0 on success")
Those changes especially affected the qemu-img dd patches, because
the context also changed, but also some of our block drivers used
the functions.
* Drop qemu-common.h include: it got renamed after essentially
everything was moved to other headers. The only remaining user I
could find for things dropped from the header between 7.0 and 7.1
was qemu_get_vm_name() in the iscsi-initiatorname patch, but it
already includes the header to which the function was moved.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2022-10-14 15:07:13 +03:00
|
|
|
static bool have_custom_ram_size;
|
|
|
|
static const char *ram_memdev_id;
|
2021-10-11 14:55:34 +03:00
|
|
|
static QDict *machine_opts_dict;
|
2021-05-27 13:43:32 +03:00
|
|
|
+static const char *loadstate;
|
|
|
|
static QTAILQ_HEAD(, ObjectOption) object_opts = QTAILQ_HEAD_INITIALIZER(object_opts);
|
2022-02-11 12:24:33 +03:00
|
|
|
static QTAILQ_HEAD(, DeviceOption) device_opts = QTAILQ_HEAD_INITIALIZER(device_opts);
|
update submodule and patches to 7.1.0
Notable changes:
* The only big change is the switch to using a custom QIOChannel for
savevm-async, because the previously used QEMUFileOps was dropped.
Changes to the current implementation:
* Switch to vector based methods as required for an IO channel. For
short reads the passed-in IO vector is stuffed with zeroes at the
end, just to be sure.
* For reading: The documentation in include/io/channel.h states that
at least one byte should be read, so also error out when whe are
at the very end instead of returning 0.
* For reading: Fix off-by-one error when request goes beyond end.
The wrong code piece was:
if ((pos + size) > maxlen) {
size = maxlen - pos - 1;
}
Previously, the last byte would not be read. It's actually
possible to get a snapshot .raw file that has content all the way
up the final 512 byte (= BDRV_SECTOR_SIZE) boundary without any
trailing zero bytes (I wrote a script to do it).
Luckily, it didn't cause a real issue, because qemu_loadvm_state()
is not interested in the final (i.e. QEMU_VM_VMDESCRIPTION)
section. The buffer for reading it is simply freed up afterwards
and the function will assume that it read the whole section, even
if that's not the case.
* For writing: Make use of the generated blk_pwritev() wrapper
instead of manually wrapping the coroutine to simplify and save a
few lines.
* Adapt to changed interfaces for blk_{pread,pwrite}:
* a9262f551e ("block: Change blk_{pread,pwrite}() param order")
* 3b35d4542c ("block: Add a 'flags' param to blk_pread()")
* bf5b16fa40 ("block: Make blk_{pread,pwrite}() return 0 on success")
Those changes especially affected the qemu-img dd patches, because
the context also changed, but also some of our block drivers used
the functions.
* Drop qemu-common.h include: it got renamed after essentially
everything was moved to other headers. The only remaining user I
could find for things dropped from the header between 7.0 and 7.1
was qemu_get_vm_name() in the iscsi-initiatorname patch, but it
already includes the header to which the function was moved.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2022-10-14 15:07:13 +03:00
|
|
|
static int display_remote;
|
update submodule and patches to QEMU 8.2.2
This version includes both the AioContext lock and the block graph
lock, so there might be some deadlocks lurking. It's not possible to
disable the block graph lock like was done in QEMU 8.1, because there
are no changes like the function bdrv_schedule_unref() that require
it. QEMU 9.0 will finally get rid of the AioContext locking.
During live-restore with a VirtIO SCSI drive with iothread there is a
known racy deadlock related to the AioContext lock. Not new [1], but
not sure if more likely now. Should be fixed in QEMU 9.0.
The block graph lock comes with annotations that can be checked by
clang's TSA. This required changes to the block drivers, i.e.
alloc-track, pbs, zeroinit as well as taking the appropriate locks
in pve-backup, savevm-async, vma-reader.
Local variable shadowing is prohibited via a compiler flag now,
required slight adaptation in vma.c.
Major changes only affect alloc-track:
* It is not possible to call a generated co-wrapper like
bdrv_get_info() while holding the block graph lock exclusively [0],
which does happen during initialization of alloc-track when the
backing hd is set and the refresh_limits driver callback is invoked.
The bdrv_get_info() call to get the cluster size is moved to
directly after opening the file child in track_open().
The important thing is that at least the request alignment for the
write target is used, because then the RMW cycle in bdrv_pwritev
will gather enough data from the backing file. Partial cluster
allocations in the target are not a fundamental issue, because the
driver returns its allocation status based on the bitmap, so any
other data that maps to the same cluster will still be copied later
by a stream job (or during writes to that cluster).
* Replacing the node cannot be done in the
track_co_change_backing_file() callback, because it is a coroutine
and cannot hold the block graph lock exclusively. So it is moved to
the stream job itself with the auto-remove option not having an
effect anymore (qemu-server would always set it anyways).
In the future, there could either be a special option for the stream
job, or maybe the upcoming blockdev-replace QMP command can be used.
Replacing the backing child is actually already done in the stream
job, so no need to do it in the track_co_change_backing_file()
callback. It also cannot be called from a coroutine. Looking at the
implementation in the qcow2 driver, it doesn't seem to be intended
to change the backing child itself, just update driver-internal
state.
Other changes:
* alloc-track: Error out early when used without auto-remove. Since
replacing the node now happens in the stream job, where the option
cannot be read from (it's internal to the driver), it will always be
treated as 'on'. Makes sure to have users beside qemu-server notice
the change (should they even exist). The option can be fully dropped
in the future while adding a version guard in qemu-server.
* alloc-track: Avoid seemingly superfluous child permission update.
Doesn't seem necessary nowadays (maybe after commit "alloc-track:
fix deadlock during drop" where the dropping is not rescheduled and
delayed anymore or some upstream change). Replacing the block node
will already update the permissions of the new node (which was the
file child before). Should there really be some issue, instead of
having a drop state, this could also be just based off the fact
whether there is still a backing child.
Dumping the cumulative (shared) permissions for the BDS with a debug
print yields the same values after this patch and with QEMU 8.1,
namely 3 and 5.
* PBS block driver: compile unconditionally. Proxmox VE always needs
it and something in the build process changed to make it not enabled
by default. Probably would need to move the build option to meson
otherwise.
* backup: job unreferencing during cleanup needs to happen outside of
coroutine, so it was moved to before invoking the clean
* mirror: Cherry-pick stable fix to avoid potential deadlock.
* savevm-async: migrate_init now can fail, so propagate potential
error.
* savevm-async: compression counters are not accessible outside
migration/ram-compress now, so drop code that prophylactically set
it to zero.
[0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/
[1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
|
|
|
@@ -2715,6 +2716,12 @@ void qmp_x_exit_preconfig(Error **errp)
|
2022-02-11 12:24:33 +03:00
|
|
|
|
|
|
|
if (loadvm) {
|
|
|
|
load_snapshot(loadvm, NULL, false, NULL, &error_fatal);
|
2017-04-05 11:49:19 +03:00
|
|
|
+ } else if (loadstate) {
|
2018-02-22 14:34:57 +03:00
|
|
|
+ Error *local_err = NULL;
|
|
|
|
+ if (load_snapshot_from_blockdev(loadstate, &local_err) < 0) {
|
|
|
|
+ error_report_err(local_err);
|
2017-04-05 11:49:19 +03:00
|
|
|
+ autostart = 0;
|
|
|
|
+ }
|
|
|
|
}
|
2019-06-06 13:58:15 +03:00
|
|
|
if (replay_mode != REPLAY_MODE_NONE) {
|
|
|
|
replay_vmstate_init();
|
update submodule and patches to QEMU 8.2.2
This version includes both the AioContext lock and the block graph
lock, so there might be some deadlocks lurking. It's not possible to
disable the block graph lock like was done in QEMU 8.1, because there
are no changes like the function bdrv_schedule_unref() that require
it. QEMU 9.0 will finally get rid of the AioContext locking.
During live-restore with a VirtIO SCSI drive with iothread there is a
known racy deadlock related to the AioContext lock. Not new [1], but
not sure if more likely now. Should be fixed in QEMU 9.0.
The block graph lock comes with annotations that can be checked by
clang's TSA. This required changes to the block drivers, i.e.
alloc-track, pbs, zeroinit as well as taking the appropriate locks
in pve-backup, savevm-async, vma-reader.
Local variable shadowing is prohibited via a compiler flag now,
required slight adaptation in vma.c.
Major changes only affect alloc-track:
* It is not possible to call a generated co-wrapper like
bdrv_get_info() while holding the block graph lock exclusively [0],
which does happen during initialization of alloc-track when the
backing hd is set and the refresh_limits driver callback is invoked.
The bdrv_get_info() call to get the cluster size is moved to
directly after opening the file child in track_open().
The important thing is that at least the request alignment for the
write target is used, because then the RMW cycle in bdrv_pwritev
will gather enough data from the backing file. Partial cluster
allocations in the target are not a fundamental issue, because the
driver returns its allocation status based on the bitmap, so any
other data that maps to the same cluster will still be copied later
by a stream job (or during writes to that cluster).
* Replacing the node cannot be done in the
track_co_change_backing_file() callback, because it is a coroutine
and cannot hold the block graph lock exclusively. So it is moved to
the stream job itself with the auto-remove option not having an
effect anymore (qemu-server would always set it anyways).
In the future, there could either be a special option for the stream
job, or maybe the upcoming blockdev-replace QMP command can be used.
Replacing the backing child is actually already done in the stream
job, so no need to do it in the track_co_change_backing_file()
callback. It also cannot be called from a coroutine. Looking at the
implementation in the qcow2 driver, it doesn't seem to be intended
to change the backing child itself, just update driver-internal
state.
Other changes:
* alloc-track: Error out early when used without auto-remove. Since
replacing the node now happens in the stream job, where the option
cannot be read from (it's internal to the driver), it will always be
treated as 'on'. Makes sure to have users beside qemu-server notice
the change (should they even exist). The option can be fully dropped
in the future while adding a version guard in qemu-server.
* alloc-track: Avoid seemingly superfluous child permission update.
Doesn't seem necessary nowadays (maybe after commit "alloc-track:
fix deadlock during drop" where the dropping is not rescheduled and
delayed anymore or some upstream change). Replacing the block node
will already update the permissions of the new node (which was the
file child before). Should there really be some issue, instead of
having a drop state, this could also be just based off the fact
whether there is still a backing child.
Dumping the cumulative (shared) permissions for the BDS with a debug
print yields the same values after this patch and with QEMU 8.1,
namely 3 and 5.
* PBS block driver: compile unconditionally. Proxmox VE always needs
it and something in the build process changed to make it not enabled
by default. Probably would need to move the build option to meson
otherwise.
* backup: job unreferencing during cleanup needs to happen outside of
coroutine, so it was moved to before invoking the clean
* mirror: Cherry-pick stable fix to avoid potential deadlock.
* savevm-async: migrate_init now can fail, so propagate potential
error.
* savevm-async: compression counters are not accessible outside
migration/ram-compress now, so drop code that prophylactically set
it to zero.
[0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/
[1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
|
|
|
@@ -3265,6 +3272,9 @@ void qemu_init(int argc, char **argv)
|
2021-05-27 13:43:32 +03:00
|
|
|
case QEMU_OPTION_loadvm:
|
|
|
|
loadvm = optarg;
|
|
|
|
break;
|
|
|
|
+ case QEMU_OPTION_loadstate:
|
|
|
|
+ loadstate = optarg;
|
|
|
|
+ break;
|
|
|
|
case QEMU_OPTION_full_screen:
|
|
|
|
dpy.has_full_screen = true;
|
|
|
|
dpy.full_screen = true;
|