f1eed34ac7
This version includes both the AioContext lock and the block graph lock, so there might be some deadlocks lurking. It's not possible to disable the block graph lock like was done in QEMU 8.1, because there are no changes like the function bdrv_schedule_unref() that require it. QEMU 9.0 will finally get rid of the AioContext locking. During live-restore with a VirtIO SCSI drive with iothread there is a known racy deadlock related to the AioContext lock. Not new [1], but not sure if more likely now. Should be fixed in QEMU 9.0. The block graph lock comes with annotations that can be checked by clang's TSA. This required changes to the block drivers, i.e. alloc-track, pbs, zeroinit as well as taking the appropriate locks in pve-backup, savevm-async, vma-reader. Local variable shadowing is prohibited via a compiler flag now, required slight adaptation in vma.c. Major changes only affect alloc-track: * It is not possible to call a generated co-wrapper like bdrv_get_info() while holding the block graph lock exclusively [0], which does happen during initialization of alloc-track when the backing hd is set and the refresh_limits driver callback is invoked. The bdrv_get_info() call to get the cluster size is moved to directly after opening the file child in track_open(). The important thing is that at least the request alignment for the write target is used, because then the RMW cycle in bdrv_pwritev will gather enough data from the backing file. Partial cluster allocations in the target are not a fundamental issue, because the driver returns its allocation status based on the bitmap, so any other data that maps to the same cluster will still be copied later by a stream job (or during writes to that cluster). * Replacing the node cannot be done in the track_co_change_backing_file() callback, because it is a coroutine and cannot hold the block graph lock exclusively. So it is moved to the stream job itself with the auto-remove option not having an effect anymore (qemu-server would always set it anyways). In the future, there could either be a special option for the stream job, or maybe the upcoming blockdev-replace QMP command can be used. Replacing the backing child is actually already done in the stream job, so no need to do it in the track_co_change_backing_file() callback. It also cannot be called from a coroutine. Looking at the implementation in the qcow2 driver, it doesn't seem to be intended to change the backing child itself, just update driver-internal state. Other changes: * alloc-track: Error out early when used without auto-remove. Since replacing the node now happens in the stream job, where the option cannot be read from (it's internal to the driver), it will always be treated as 'on'. Makes sure to have users beside qemu-server notice the change (should they even exist). The option can be fully dropped in the future while adding a version guard in qemu-server. * alloc-track: Avoid seemingly superfluous child permission update. Doesn't seem necessary nowadays (maybe after commit "alloc-track: fix deadlock during drop" where the dropping is not rescheduled and delayed anymore or some upstream change). Replacing the block node will already update the permissions of the new node (which was the file child before). Should there really be some issue, instead of having a drop state, this could also be just based off the fact whether there is still a backing child. Dumping the cumulative (shared) permissions for the BDS with a debug print yields the same values after this patch and with QEMU 8.1, namely 3 and 5. * PBS block driver: compile unconditionally. Proxmox VE always needs it and something in the build process changed to make it not enabled by default. Probably would need to move the build option to meson otherwise. * backup: job unreferencing during cleanup needs to happen outside of coroutine, so it was moved to before invoking the clean * mirror: Cherry-pick stable fix to avoid potential deadlock. * savevm-async: migrate_init now can fail, so propagate potential error. * savevm-async: compression counters are not accessible outside migration/ram-compress now, so drop code that prophylactically set it to zero. [0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/ [1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/ Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
339 lines
14 KiB
Diff
339 lines
14 KiB
Diff
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
|
|
From: Fiona Ebner <f.ebner@proxmox.com>
|
|
Date: Thu, 11 Apr 2024 11:29:28 +0200
|
|
Subject: [PATCH] PVE backup: add fleecing option
|
|
|
|
When a fleecing option is given, it is expected that each device has
|
|
a corresponding "-fleecing" block device already attached, except for
|
|
EFI disk and TPM state, where fleecing is never used.
|
|
|
|
The following graph was adapted from [0] which also contains more
|
|
details about fleecing.
|
|
|
|
[guest]
|
|
|
|
|
| root
|
|
v file
|
|
[copy-before-write]<------[snapshot-access]
|
|
| |
|
|
| file | target
|
|
v v
|
|
[source] [fleecing]
|
|
|
|
For fleecing, a copy-before-write filter is inserted on top of the
|
|
source node, as well as a snapshot-access node pointing to the filter
|
|
node which allows to read the consistent state of the image at the
|
|
time it was inserted. New guest writes are passed through the
|
|
copy-before-write filter which will first copy over old data to the
|
|
fleecing image in case that old data is still needed by the
|
|
snapshot-access node.
|
|
|
|
The backup process will sequentially read from the snapshot access,
|
|
which has a bitmap and knows whether to read from the original image
|
|
or the fleecing image to get the "snapshot" state, i.e. data from the
|
|
source image at the time when the copy-before-write filter was
|
|
inserted. After reading, the copied sections are discarded from the
|
|
fleecing image to reduce space usage.
|
|
|
|
All of this can be restricted by an initial dirty bitmap to parts of
|
|
the source image that are required for an incremental backup.
|
|
|
|
For discard to work, it is necessary that the fleecing image does not
|
|
have a larger cluster size than the backup job granularity. Since
|
|
querying that size does not always work, e.g. for RBD with krbd, the
|
|
cluster size will not be reported, a minimum of 4 MiB is used. A job
|
|
with PBS target already has at least this granularity, so it's just
|
|
relevant for other targets. I.e. edge cases where this minimum is not
|
|
enough should be very rare in practice. If ever necessary in the
|
|
future, can still add a passed-in value for the backup QMP command to
|
|
override.
|
|
|
|
Additionally, the cbw-timeout and on-cbw-error=break-snapshot options
|
|
are set when installing the copy-before-write filter and
|
|
snapshot-access. When an error or timeout occurs, the problematic (and
|
|
each further) snapshot operation will fail and thus cancel the backup
|
|
instead of breaking the guest write.
|
|
|
|
Note that job_id cannot be inferred from the snapshot-access bs because
|
|
it has no parent, so just pass the one from the original bs.
|
|
|
|
[0]: https://www.mail-archive.com/qemu-devel@nongnu.org/msg876056.html
|
|
|
|
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
|
|
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
|
|
---
|
|
block/monitor/block-hmp-cmds.c | 1 +
|
|
pve-backup.c | 145 ++++++++++++++++++++++++++++++++-
|
|
qapi/block-core.json | 8 +-
|
|
3 files changed, 150 insertions(+), 4 deletions(-)
|
|
|
|
diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
|
|
index 1656859e03..f6cc9e5cf7 100644
|
|
--- a/block/monitor/block-hmp-cmds.c
|
|
+++ b/block/monitor/block-hmp-cmds.c
|
|
@@ -1072,6 +1072,7 @@ void coroutine_fn hmp_backup(Monitor *mon, const QDict *qdict)
|
|
NULL, NULL,
|
|
devlist, qdict_haskey(qdict, "speed"), speed,
|
|
false, 0, // BackupPerf max-workers
|
|
+ false, false, // fleecing
|
|
&error);
|
|
|
|
hmp_handle_error(mon, error);
|
|
diff --git a/pve-backup.c b/pve-backup.c
|
|
index 777db7938e..4c728951ac 100644
|
|
--- a/pve-backup.c
|
|
+++ b/pve-backup.c
|
|
@@ -7,9 +7,11 @@
|
|
#include "sysemu/blockdev.h"
|
|
#include "block/block_int-global-state.h"
|
|
#include "block/blockjob.h"
|
|
+#include "block/copy-before-write.h"
|
|
#include "block/dirty-bitmap.h"
|
|
#include "block/graph-lock.h"
|
|
#include "qapi/qapi-commands-block.h"
|
|
+#include "qapi/qmp/qdict.h"
|
|
#include "qapi/qmp/qerror.h"
|
|
#include "qemu/cutils.h"
|
|
|
|
@@ -81,8 +83,15 @@ static void pvebackup_init(void)
|
|
// initialize PVEBackupState at startup
|
|
opts_init(pvebackup_init);
|
|
|
|
+typedef struct PVEBackupFleecingInfo {
|
|
+ BlockDriverState *bs;
|
|
+ BlockDriverState *cbw;
|
|
+ BlockDriverState *snapshot_access;
|
|
+} PVEBackupFleecingInfo;
|
|
+
|
|
typedef struct PVEBackupDevInfo {
|
|
BlockDriverState *bs;
|
|
+ PVEBackupFleecingInfo fleecing;
|
|
size_t size;
|
|
uint64_t block_size;
|
|
uint8_t dev_id;
|
|
@@ -355,6 +364,25 @@ static void pvebackup_complete_cb(void *opaque, int ret)
|
|
PVEBackupDevInfo *di = opaque;
|
|
di->completed_ret = ret;
|
|
|
|
+ /*
|
|
+ * Handle block-graph specific cleanup (for fleecing) outside of the coroutine, because the work
|
|
+ * won't be done as a coroutine anyways:
|
|
+ * - For snapshot_access, allows doing bdrv_unref() directly. Doing it via bdrv_co_unref() would
|
|
+ * just spawn a BH calling bdrv_unref().
|
|
+ * - For cbw, draining would need to spawn a BH.
|
|
+ *
|
|
+ * Note that the AioContext lock is already acquired by our caller, i.e.
|
|
+ * job_finalize_single_locked()
|
|
+ */
|
|
+ if (di->fleecing.snapshot_access) {
|
|
+ bdrv_unref(di->fleecing.snapshot_access);
|
|
+ di->fleecing.snapshot_access = NULL;
|
|
+ }
|
|
+ if (di->fleecing.cbw) {
|
|
+ bdrv_cbw_drop(di->fleecing.cbw);
|
|
+ di->fleecing.cbw = NULL;
|
|
+ }
|
|
+
|
|
/*
|
|
* Needs to happen outside of coroutine, because it takes the graph write lock.
|
|
*/
|
|
@@ -525,9 +553,84 @@ static void create_backup_jobs_bh(void *opaque) {
|
|
|
|
bdrv_drained_begin(di->bs);
|
|
|
|
+ BackupPerf perf = (BackupPerf){ .max_workers = backup_state.perf.max_workers };
|
|
+
|
|
+ BlockDriverState *source_bs = di->bs;
|
|
+ bool discard_source = false;
|
|
+ bdrv_graph_co_rdlock();
|
|
+ const char *job_id = bdrv_get_device_name(di->bs);
|
|
+ bdrv_graph_co_rdunlock();
|
|
+ if (di->fleecing.bs) {
|
|
+ QDict *cbw_opts = qdict_new();
|
|
+ qdict_put_str(cbw_opts, "driver", "copy-before-write");
|
|
+ qdict_put_str(cbw_opts, "file", bdrv_get_node_name(di->bs));
|
|
+ qdict_put_str(cbw_opts, "target", bdrv_get_node_name(di->fleecing.bs));
|
|
+
|
|
+ if (di->bitmap) {
|
|
+ /*
|
|
+ * Only guest writes to parts relevant for the backup need to be intercepted with
|
|
+ * old data being copied to the fleecing image.
|
|
+ */
|
|
+ qdict_put_str(cbw_opts, "bitmap.node", bdrv_get_node_name(di->bs));
|
|
+ qdict_put_str(cbw_opts, "bitmap.name", bdrv_dirty_bitmap_name(di->bitmap));
|
|
+ }
|
|
+ /*
|
|
+ * Fleecing storage is supposed to be fast and it's better to break backup than guest
|
|
+ * writes. Certain guest drivers like VirtIO-win have 60 seconds timeout by default, so
|
|
+ * abort a bit before that.
|
|
+ */
|
|
+ qdict_put_str(cbw_opts, "on-cbw-error", "break-snapshot");
|
|
+ qdict_put_int(cbw_opts, "cbw-timeout", 45);
|
|
+
|
|
+ di->fleecing.cbw = bdrv_insert_node(di->bs, cbw_opts, BDRV_O_RDWR, &local_err);
|
|
+
|
|
+ if (!di->fleecing.cbw) {
|
|
+ error_setg(errp, "appending cbw node for fleecing failed: %s",
|
|
+ local_err ? error_get_pretty(local_err) : "unknown error");
|
|
+ break;
|
|
+ }
|
|
+
|
|
+ QDict *snapshot_access_opts = qdict_new();
|
|
+ qdict_put_str(snapshot_access_opts, "driver", "snapshot-access");
|
|
+ qdict_put_str(snapshot_access_opts, "file", bdrv_get_node_name(di->fleecing.cbw));
|
|
+
|
|
+ /*
|
|
+ * Holding the AioContext lock here would cause a deadlock, because bdrv_open_driver()
|
|
+ * will aquire it a second time. But it's allowed to be held exactly once when polling
|
|
+ * and that happens when the bdrv_refresh_total_sectors() call is made there.
|
|
+ */
|
|
+ aio_context_release(aio_context);
|
|
+ di->fleecing.snapshot_access =
|
|
+ bdrv_open(NULL, NULL, snapshot_access_opts, BDRV_O_RDWR | BDRV_O_UNMAP, &local_err);
|
|
+ aio_context_acquire(aio_context);
|
|
+ if (!di->fleecing.snapshot_access) {
|
|
+ error_setg(errp, "setting up snapshot access for fleecing failed: %s",
|
|
+ local_err ? error_get_pretty(local_err) : "unknown error");
|
|
+ break;
|
|
+ }
|
|
+ source_bs = di->fleecing.snapshot_access;
|
|
+ discard_source = true;
|
|
+
|
|
+ /*
|
|
+ * bdrv_get_info() just retuns 0 (= doesn't matter) for RBD when using krbd. But discard
|
|
+ * on the fleecing image won't work if the backup job's granularity is less than the RBD
|
|
+ * object size (default 4 MiB), so it does matter. Always use at least 4 MiB. With a PBS
|
|
+ * target, the backup job granularity would already be at least this much.
|
|
+ */
|
|
+ perf.min_cluster_size = 4 * 1024 * 1024;
|
|
+ /*
|
|
+ * For discard to work, cluster size for the backup job must be at least the same as for
|
|
+ * the fleecing image.
|
|
+ */
|
|
+ BlockDriverInfo bdi;
|
|
+ if (bdrv_get_info(di->fleecing.bs, &bdi) >= 0) {
|
|
+ perf.min_cluster_size = MAX(perf.min_cluster_size, bdi.cluster_size);
|
|
+ }
|
|
+ }
|
|
+
|
|
BlockJob *job = backup_job_create(
|
|
- NULL, di->bs, di->target, backup_state.speed, sync_mode, di->bitmap,
|
|
- bitmap_mode, false, NULL, &backup_state.perf, BLOCKDEV_ON_ERROR_REPORT,
|
|
+ job_id, source_bs, di->target, backup_state.speed, sync_mode, di->bitmap,
|
|
+ bitmap_mode, false, discard_source, NULL, &perf, BLOCKDEV_ON_ERROR_REPORT,
|
|
BLOCKDEV_ON_ERROR_REPORT, JOB_DEFAULT, pvebackup_complete_cb, di, backup_state.txn,
|
|
&local_err);
|
|
|
|
@@ -585,6 +688,14 @@ static void create_backup_jobs_bh(void *opaque) {
|
|
aio_co_enter(data->ctx, data->co);
|
|
}
|
|
|
|
+/*
|
|
+ * EFI disk and TPM state are small and it's just not worth setting up fleecing for them.
|
|
+ */
|
|
+static bool device_uses_fleecing(const char *device_id)
|
|
+{
|
|
+ return strncmp(device_id, "drive-efidisk", 13) && strncmp(device_id, "drive-tpmstate", 14);
|
|
+}
|
|
+
|
|
/*
|
|
* Returns a list of device infos, which needs to be freed by the caller. In
|
|
* case of an error, errp will be set, but the returned value might still be a
|
|
@@ -592,6 +703,7 @@ static void create_backup_jobs_bh(void *opaque) {
|
|
*/
|
|
static GList coroutine_fn GRAPH_RDLOCK *get_device_info(
|
|
const char *devlist,
|
|
+ bool fleecing,
|
|
Error **errp)
|
|
{
|
|
gchar **devs = NULL;
|
|
@@ -615,6 +727,31 @@ static GList coroutine_fn GRAPH_RDLOCK *get_device_info(
|
|
}
|
|
PVEBackupDevInfo *di = g_new0(PVEBackupDevInfo, 1);
|
|
di->bs = bs;
|
|
+
|
|
+ if (fleecing && device_uses_fleecing(*d)) {
|
|
+ g_autofree gchar *fleecing_devid = g_strconcat(*d, "-fleecing", NULL);
|
|
+ BlockBackend *fleecing_blk = blk_by_name(fleecing_devid);
|
|
+ if (!fleecing_blk) {
|
|
+ error_set(errp, ERROR_CLASS_DEVICE_NOT_FOUND,
|
|
+ "Device '%s' not found", fleecing_devid);
|
|
+ goto err;
|
|
+ }
|
|
+ BlockDriverState *fleecing_bs = blk_bs(fleecing_blk);
|
|
+ if (!bdrv_co_is_inserted(fleecing_bs)) {
|
|
+ error_setg(errp, QERR_DEVICE_HAS_NO_MEDIUM, fleecing_devid);
|
|
+ goto err;
|
|
+ }
|
|
+ /*
|
|
+ * Fleecing image needs to be the same size to act as a cbw target.
|
|
+ */
|
|
+ if (bs->total_sectors != fleecing_bs->total_sectors) {
|
|
+ error_setg(errp, "Size mismatch for '%s' - sector count %ld != %ld",
|
|
+ fleecing_devid, fleecing_bs->total_sectors, bs->total_sectors);
|
|
+ goto err;
|
|
+ }
|
|
+ di->fleecing.bs = fleecing_bs;
|
|
+ }
|
|
+
|
|
di_list = g_list_append(di_list, di);
|
|
d++;
|
|
}
|
|
@@ -664,6 +801,7 @@ UuidInfo coroutine_fn *qmp_backup(
|
|
const char *devlist,
|
|
bool has_speed, int64_t speed,
|
|
bool has_max_workers, int64_t max_workers,
|
|
+ bool has_fleecing, bool fleecing,
|
|
Error **errp)
|
|
{
|
|
assert(qemu_in_coroutine());
|
|
@@ -692,7 +830,7 @@ UuidInfo coroutine_fn *qmp_backup(
|
|
format = has_format ? format : BACKUP_FORMAT_VMA;
|
|
|
|
bdrv_graph_co_rdlock();
|
|
- di_list = get_device_info(devlist, &local_err);
|
|
+ di_list = get_device_info(devlist, has_fleecing && fleecing, &local_err);
|
|
bdrv_graph_co_rdunlock();
|
|
if (local_err) {
|
|
error_propagate(errp, local_err);
|
|
@@ -1100,5 +1238,6 @@ ProxmoxSupportStatus *qmp_query_proxmox_support(Error **errp)
|
|
ret->query_bitmap_info = true;
|
|
ret->pbs_masterkey = true;
|
|
ret->backup_max_workers = true;
|
|
+ ret->backup_fleecing = true;
|
|
return ret;
|
|
}
|
|
diff --git a/qapi/block-core.json b/qapi/block-core.json
|
|
index 48eec4ef29..1c036e488e 100644
|
|
--- a/qapi/block-core.json
|
|
+++ b/qapi/block-core.json
|
|
@@ -935,6 +935,10 @@
|
|
#
|
|
# @max-workers: see @BackupPerf for details. Default 16.
|
|
#
|
|
+# @fleecing: perform a backup with fleecing. For each device in @devlist, a
|
|
+# corresponing '-fleecing' device with the same size already needs to
|
|
+# be present.
|
|
+#
|
|
# Returns: the uuid of the backup job
|
|
#
|
|
##
|
|
@@ -955,7 +959,8 @@
|
|
'*firewall-file': 'str',
|
|
'*devlist': 'str',
|
|
'*speed': 'int',
|
|
- '*max-workers': 'int' },
|
|
+ '*max-workers': 'int',
|
|
+ '*fleecing': 'bool' },
|
|
'returns': 'UuidInfo', 'coroutine': true }
|
|
|
|
##
|
|
@@ -1011,6 +1016,7 @@
|
|
'pbs-dirty-bitmap-migration': 'bool',
|
|
'pbs-masterkey': 'bool',
|
|
'pbs-library-version': 'str',
|
|
+ 'backup-fleecing': 'bool',
|
|
'backup-max-workers': 'bool' } }
|
|
|
|
##
|