2023-03-08 14:51:05 +03:00
|
|
|
|
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
|
|
|
|
|
From: Fiona Ebner <f.ebner@proxmox.com>
|
|
|
|
|
Date: Tue, 7 Mar 2023 15:03:02 +0100
|
|
|
|
|
Subject: [PATCH] ide: avoid potential deadlock when draining during trim
|
|
|
|
|
MIME-Version: 1.0
|
|
|
|
|
Content-Type: text/plain; charset=UTF-8
|
|
|
|
|
Content-Transfer-Encoding: 8bit
|
|
|
|
|
|
|
|
|
|
The deadlock can happen as follows:
|
|
|
|
|
1. ide_issue_trim is called, and increments the in_flight counter.
|
|
|
|
|
2. ide_issue_trim_cb calls blk_aio_pdiscard.
|
|
|
|
|
3. Somebody else starts draining (e.g. backup to insert the cbw node).
|
|
|
|
|
4. ide_issue_trim_cb is called as the completion callback for
|
|
|
|
|
blk_aio_pdiscard.
|
|
|
|
|
5. ide_issue_trim_cb issues yet another blk_aio_pdiscard request.
|
|
|
|
|
6. The request is added to the wait queue via blk_wait_while_drained,
|
|
|
|
|
because draining has been started.
|
|
|
|
|
7. Nobody ever decrements the in_flight counter and draining can't
|
|
|
|
|
finish. This would be done by ide_trim_bh_cb, which is called after
|
|
|
|
|
ide_issue_trim_cb has issued its last request, but
|
|
|
|
|
ide_issue_trim_cb is not called anymore, because it's the
|
|
|
|
|
completion callback of blk_aio_pdiscard, which waits on draining.
|
|
|
|
|
|
|
|
|
|
Quoting Hanna Czenczek:
|
|
|
|
|
> The point of 7e5cdb345f was that we need any in-flight count to
|
|
|
|
|
> accompany a set s->bus->dma->aiocb. While blk_aio_pdiscard() is
|
|
|
|
|
> happening, we don’t necessarily need another count. But we do need
|
|
|
|
|
> it while there is no blk_aio_pdiscard().
|
|
|
|
|
> ide_issue_trim_cb() returns in two cases (and, recursively through
|
|
|
|
|
> its callers, leaves s->bus->dma->aiocb set):
|
|
|
|
|
> 1. After calling blk_aio_pdiscard(), which will keep an in-flight
|
|
|
|
|
> count,
|
|
|
|
|
> 2. After calling replay_bh_schedule_event() (i.e.
|
|
|
|
|
> qemu_bh_schedule()), which does not keep an in-flight count.
|
|
|
|
|
|
|
|
|
|
Thus, even after moving the blk_inc_in_flight to above the
|
|
|
|
|
replay_bh_schedule_event call, the invariant "ide_issue_trim_cb
|
|
|
|
|
returns with an accompanying in-flight count" is still satisfied.
|
|
|
|
|
|
2023-03-09 16:37:34 +03:00
|
|
|
|
However, the issue 7e5cdb345f fixed for canceling resurfaces, because
|
|
|
|
|
ide_cancel_dma_sync assumes that it just needs to drain once. But now
|
|
|
|
|
the in_flight count is not consistently > 0 during the trim operation.
|
|
|
|
|
So, change it to drain until !s->bus->dma->aiocb, which means that the
|
|
|
|
|
operation finished (s->bus->dma->aiocb is cleared by ide_set_inactive
|
|
|
|
|
via the ide_dma_cb when the end of the transfer is reached).
|
|
|
|
|
|
|
|
|
|
Discussion here:
|
|
|
|
|
https://lists.nongnu.org/archive/html/qemu-devel/2023-03/msg02506.html
|
|
|
|
|
|
2023-03-08 14:51:05 +03:00
|
|
|
|
Fixes: 7e5cdb345f ("ide: Increment BB in-flight counter for TRIM BH")
|
|
|
|
|
Suggested-by: Hanna Czenczek <hreitz@redhat.com>
|
|
|
|
|
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
|
|
|
|
|
---
|
2023-03-09 16:37:34 +03:00
|
|
|
|
hw/ide/core.c | 12 ++++++------
|
|
|
|
|
1 file changed, 6 insertions(+), 6 deletions(-)
|
2023-03-08 14:51:05 +03:00
|
|
|
|
|
|
|
|
|
diff --git a/hw/ide/core.c b/hw/ide/core.c
|
2024-04-25 18:21:29 +03:00
|
|
|
|
index e8cb2dac92..3b21acf651 100644
|
2023-03-08 14:51:05 +03:00
|
|
|
|
--- a/hw/ide/core.c
|
|
|
|
|
+++ b/hw/ide/core.c
|
update submodule and patches to QEMU 8.2.2
This version includes both the AioContext lock and the block graph
lock, so there might be some deadlocks lurking. It's not possible to
disable the block graph lock like was done in QEMU 8.1, because there
are no changes like the function bdrv_schedule_unref() that require
it. QEMU 9.0 will finally get rid of the AioContext locking.
During live-restore with a VirtIO SCSI drive with iothread there is a
known racy deadlock related to the AioContext lock. Not new [1], but
not sure if more likely now. Should be fixed in QEMU 9.0.
The block graph lock comes with annotations that can be checked by
clang's TSA. This required changes to the block drivers, i.e.
alloc-track, pbs, zeroinit as well as taking the appropriate locks
in pve-backup, savevm-async, vma-reader.
Local variable shadowing is prohibited via a compiler flag now,
required slight adaptation in vma.c.
Major changes only affect alloc-track:
* It is not possible to call a generated co-wrapper like
bdrv_get_info() while holding the block graph lock exclusively [0],
which does happen during initialization of alloc-track when the
backing hd is set and the refresh_limits driver callback is invoked.
The bdrv_get_info() call to get the cluster size is moved to
directly after opening the file child in track_open().
The important thing is that at least the request alignment for the
write target is used, because then the RMW cycle in bdrv_pwritev
will gather enough data from the backing file. Partial cluster
allocations in the target are not a fundamental issue, because the
driver returns its allocation status based on the bitmap, so any
other data that maps to the same cluster will still be copied later
by a stream job (or during writes to that cluster).
* Replacing the node cannot be done in the
track_co_change_backing_file() callback, because it is a coroutine
and cannot hold the block graph lock exclusively. So it is moved to
the stream job itself with the auto-remove option not having an
effect anymore (qemu-server would always set it anyways).
In the future, there could either be a special option for the stream
job, or maybe the upcoming blockdev-replace QMP command can be used.
Replacing the backing child is actually already done in the stream
job, so no need to do it in the track_co_change_backing_file()
callback. It also cannot be called from a coroutine. Looking at the
implementation in the qcow2 driver, it doesn't seem to be intended
to change the backing child itself, just update driver-internal
state.
Other changes:
* alloc-track: Error out early when used without auto-remove. Since
replacing the node now happens in the stream job, where the option
cannot be read from (it's internal to the driver), it will always be
treated as 'on'. Makes sure to have users beside qemu-server notice
the change (should they even exist). The option can be fully dropped
in the future while adding a version guard in qemu-server.
* alloc-track: Avoid seemingly superfluous child permission update.
Doesn't seem necessary nowadays (maybe after commit "alloc-track:
fix deadlock during drop" where the dropping is not rescheduled and
delayed anymore or some upstream change). Replacing the block node
will already update the permissions of the new node (which was the
file child before). Should there really be some issue, instead of
having a drop state, this could also be just based off the fact
whether there is still a backing child.
Dumping the cumulative (shared) permissions for the BDS with a debug
print yields the same values after this patch and with QEMU 8.1,
namely 3 and 5.
* PBS block driver: compile unconditionally. Proxmox VE always needs
it and something in the build process changed to make it not enabled
by default. Probably would need to move the build option to meson
otherwise.
* backup: job unreferencing during cleanup needs to happen outside of
coroutine, so it was moved to before invoking the clean
* mirror: Cherry-pick stable fix to avoid potential deadlock.
* savevm-async: migrate_init now can fail, so propagate potential
error.
* savevm-async: compression counters are not accessible outside
migration/ram-compress now, so drop code that prophylactically set
it to zero.
[0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/
[1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
|
|
|
|
@@ -456,7 +456,7 @@ static void ide_trim_bh_cb(void *opaque)
|
2023-03-08 14:51:05 +03:00
|
|
|
|
iocb->bh = NULL;
|
|
|
|
|
qemu_aio_unref(iocb);
|
|
|
|
|
|
|
|
|
|
- /* Paired with an increment in ide_issue_trim() */
|
|
|
|
|
+ /* Paired with an increment in ide_issue_trim_cb() */
|
|
|
|
|
blk_dec_in_flight(blk);
|
|
|
|
|
}
|
|
|
|
|
|
update submodule and patches to QEMU 8.2.2
This version includes both the AioContext lock and the block graph
lock, so there might be some deadlocks lurking. It's not possible to
disable the block graph lock like was done in QEMU 8.1, because there
are no changes like the function bdrv_schedule_unref() that require
it. QEMU 9.0 will finally get rid of the AioContext locking.
During live-restore with a VirtIO SCSI drive with iothread there is a
known racy deadlock related to the AioContext lock. Not new [1], but
not sure if more likely now. Should be fixed in QEMU 9.0.
The block graph lock comes with annotations that can be checked by
clang's TSA. This required changes to the block drivers, i.e.
alloc-track, pbs, zeroinit as well as taking the appropriate locks
in pve-backup, savevm-async, vma-reader.
Local variable shadowing is prohibited via a compiler flag now,
required slight adaptation in vma.c.
Major changes only affect alloc-track:
* It is not possible to call a generated co-wrapper like
bdrv_get_info() while holding the block graph lock exclusively [0],
which does happen during initialization of alloc-track when the
backing hd is set and the refresh_limits driver callback is invoked.
The bdrv_get_info() call to get the cluster size is moved to
directly after opening the file child in track_open().
The important thing is that at least the request alignment for the
write target is used, because then the RMW cycle in bdrv_pwritev
will gather enough data from the backing file. Partial cluster
allocations in the target are not a fundamental issue, because the
driver returns its allocation status based on the bitmap, so any
other data that maps to the same cluster will still be copied later
by a stream job (or during writes to that cluster).
* Replacing the node cannot be done in the
track_co_change_backing_file() callback, because it is a coroutine
and cannot hold the block graph lock exclusively. So it is moved to
the stream job itself with the auto-remove option not having an
effect anymore (qemu-server would always set it anyways).
In the future, there could either be a special option for the stream
job, or maybe the upcoming blockdev-replace QMP command can be used.
Replacing the backing child is actually already done in the stream
job, so no need to do it in the track_co_change_backing_file()
callback. It also cannot be called from a coroutine. Looking at the
implementation in the qcow2 driver, it doesn't seem to be intended
to change the backing child itself, just update driver-internal
state.
Other changes:
* alloc-track: Error out early when used without auto-remove. Since
replacing the node now happens in the stream job, where the option
cannot be read from (it's internal to the driver), it will always be
treated as 'on'. Makes sure to have users beside qemu-server notice
the change (should they even exist). The option can be fully dropped
in the future while adding a version guard in qemu-server.
* alloc-track: Avoid seemingly superfluous child permission update.
Doesn't seem necessary nowadays (maybe after commit "alloc-track:
fix deadlock during drop" where the dropping is not rescheduled and
delayed anymore or some upstream change). Replacing the block node
will already update the permissions of the new node (which was the
file child before). Should there really be some issue, instead of
having a drop state, this could also be just based off the fact
whether there is still a backing child.
Dumping the cumulative (shared) permissions for the BDS with a debug
print yields the same values after this patch and with QEMU 8.1,
namely 3 and 5.
* PBS block driver: compile unconditionally. Proxmox VE always needs
it and something in the build process changed to make it not enabled
by default. Probably would need to move the build option to meson
otherwise.
* backup: job unreferencing during cleanup needs to happen outside of
coroutine, so it was moved to before invoking the clean
* mirror: Cherry-pick stable fix to avoid potential deadlock.
* savevm-async: migrate_init now can fail, so propagate potential
error.
* savevm-async: compression counters are not accessible outside
migration/ram-compress now, so drop code that prophylactically set
it to zero.
[0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/
[1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
|
|
|
|
@@ -516,6 +516,8 @@ static void ide_issue_trim_cb(void *opaque, int ret)
|
2023-03-08 14:51:05 +03:00
|
|
|
|
done:
|
|
|
|
|
iocb->aiocb = NULL;
|
|
|
|
|
if (iocb->bh) {
|
|
|
|
|
+ /* Paired with a decrement in ide_trim_bh_cb() */
|
|
|
|
|
+ blk_inc_in_flight(s->blk);
|
|
|
|
|
replay_bh_schedule_event(iocb->bh);
|
|
|
|
|
}
|
|
|
|
|
}
|
update submodule and patches to QEMU 8.2.2
This version includes both the AioContext lock and the block graph
lock, so there might be some deadlocks lurking. It's not possible to
disable the block graph lock like was done in QEMU 8.1, because there
are no changes like the function bdrv_schedule_unref() that require
it. QEMU 9.0 will finally get rid of the AioContext locking.
During live-restore with a VirtIO SCSI drive with iothread there is a
known racy deadlock related to the AioContext lock. Not new [1], but
not sure if more likely now. Should be fixed in QEMU 9.0.
The block graph lock comes with annotations that can be checked by
clang's TSA. This required changes to the block drivers, i.e.
alloc-track, pbs, zeroinit as well as taking the appropriate locks
in pve-backup, savevm-async, vma-reader.
Local variable shadowing is prohibited via a compiler flag now,
required slight adaptation in vma.c.
Major changes only affect alloc-track:
* It is not possible to call a generated co-wrapper like
bdrv_get_info() while holding the block graph lock exclusively [0],
which does happen during initialization of alloc-track when the
backing hd is set and the refresh_limits driver callback is invoked.
The bdrv_get_info() call to get the cluster size is moved to
directly after opening the file child in track_open().
The important thing is that at least the request alignment for the
write target is used, because then the RMW cycle in bdrv_pwritev
will gather enough data from the backing file. Partial cluster
allocations in the target are not a fundamental issue, because the
driver returns its allocation status based on the bitmap, so any
other data that maps to the same cluster will still be copied later
by a stream job (or during writes to that cluster).
* Replacing the node cannot be done in the
track_co_change_backing_file() callback, because it is a coroutine
and cannot hold the block graph lock exclusively. So it is moved to
the stream job itself with the auto-remove option not having an
effect anymore (qemu-server would always set it anyways).
In the future, there could either be a special option for the stream
job, or maybe the upcoming blockdev-replace QMP command can be used.
Replacing the backing child is actually already done in the stream
job, so no need to do it in the track_co_change_backing_file()
callback. It also cannot be called from a coroutine. Looking at the
implementation in the qcow2 driver, it doesn't seem to be intended
to change the backing child itself, just update driver-internal
state.
Other changes:
* alloc-track: Error out early when used without auto-remove. Since
replacing the node now happens in the stream job, where the option
cannot be read from (it's internal to the driver), it will always be
treated as 'on'. Makes sure to have users beside qemu-server notice
the change (should they even exist). The option can be fully dropped
in the future while adding a version guard in qemu-server.
* alloc-track: Avoid seemingly superfluous child permission update.
Doesn't seem necessary nowadays (maybe after commit "alloc-track:
fix deadlock during drop" where the dropping is not rescheduled and
delayed anymore or some upstream change). Replacing the block node
will already update the permissions of the new node (which was the
file child before). Should there really be some issue, instead of
having a drop state, this could also be just based off the fact
whether there is still a backing child.
Dumping the cumulative (shared) permissions for the BDS with a debug
print yields the same values after this patch and with QEMU 8.1,
namely 3 and 5.
* PBS block driver: compile unconditionally. Proxmox VE always needs
it and something in the build process changed to make it not enabled
by default. Probably would need to move the build option to meson
otherwise.
* backup: job unreferencing during cleanup needs to happen outside of
coroutine, so it was moved to before invoking the clean
* mirror: Cherry-pick stable fix to avoid potential deadlock.
* savevm-async: migrate_init now can fail, so propagate potential
error.
* savevm-async: compression counters are not accessible outside
migration/ram-compress now, so drop code that prophylactically set
it to zero.
[0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/
[1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
|
|
|
|
@@ -528,9 +530,6 @@ BlockAIOCB *ide_issue_trim(
|
2023-10-17 15:10:09 +03:00
|
|
|
|
IDEDevice *dev = s->unit ? s->bus->slave : s->bus->master;
|
2023-03-08 14:51:05 +03:00
|
|
|
|
TrimAIOCB *iocb;
|
|
|
|
|
|
|
|
|
|
- /* Paired with a decrement in ide_trim_bh_cb() */
|
|
|
|
|
- blk_inc_in_flight(s->blk);
|
|
|
|
|
-
|
|
|
|
|
iocb = blk_aio_get(&trim_aiocb_info, s->blk, cb, cb_opaque);
|
|
|
|
|
iocb->s = s;
|
2023-10-17 15:10:09 +03:00
|
|
|
|
iocb->bh = qemu_bh_new_guarded(ide_trim_bh_cb, iocb,
|
update submodule and patches to QEMU 8.2.2
This version includes both the AioContext lock and the block graph
lock, so there might be some deadlocks lurking. It's not possible to
disable the block graph lock like was done in QEMU 8.1, because there
are no changes like the function bdrv_schedule_unref() that require
it. QEMU 9.0 will finally get rid of the AioContext locking.
During live-restore with a VirtIO SCSI drive with iothread there is a
known racy deadlock related to the AioContext lock. Not new [1], but
not sure if more likely now. Should be fixed in QEMU 9.0.
The block graph lock comes with annotations that can be checked by
clang's TSA. This required changes to the block drivers, i.e.
alloc-track, pbs, zeroinit as well as taking the appropriate locks
in pve-backup, savevm-async, vma-reader.
Local variable shadowing is prohibited via a compiler flag now,
required slight adaptation in vma.c.
Major changes only affect alloc-track:
* It is not possible to call a generated co-wrapper like
bdrv_get_info() while holding the block graph lock exclusively [0],
which does happen during initialization of alloc-track when the
backing hd is set and the refresh_limits driver callback is invoked.
The bdrv_get_info() call to get the cluster size is moved to
directly after opening the file child in track_open().
The important thing is that at least the request alignment for the
write target is used, because then the RMW cycle in bdrv_pwritev
will gather enough data from the backing file. Partial cluster
allocations in the target are not a fundamental issue, because the
driver returns its allocation status based on the bitmap, so any
other data that maps to the same cluster will still be copied later
by a stream job (or during writes to that cluster).
* Replacing the node cannot be done in the
track_co_change_backing_file() callback, because it is a coroutine
and cannot hold the block graph lock exclusively. So it is moved to
the stream job itself with the auto-remove option not having an
effect anymore (qemu-server would always set it anyways).
In the future, there could either be a special option for the stream
job, or maybe the upcoming blockdev-replace QMP command can be used.
Replacing the backing child is actually already done in the stream
job, so no need to do it in the track_co_change_backing_file()
callback. It also cannot be called from a coroutine. Looking at the
implementation in the qcow2 driver, it doesn't seem to be intended
to change the backing child itself, just update driver-internal
state.
Other changes:
* alloc-track: Error out early when used without auto-remove. Since
replacing the node now happens in the stream job, where the option
cannot be read from (it's internal to the driver), it will always be
treated as 'on'. Makes sure to have users beside qemu-server notice
the change (should they even exist). The option can be fully dropped
in the future while adding a version guard in qemu-server.
* alloc-track: Avoid seemingly superfluous child permission update.
Doesn't seem necessary nowadays (maybe after commit "alloc-track:
fix deadlock during drop" where the dropping is not rescheduled and
delayed anymore or some upstream change). Replacing the block node
will already update the permissions of the new node (which was the
file child before). Should there really be some issue, instead of
having a drop state, this could also be just based off the fact
whether there is still a backing child.
Dumping the cumulative (shared) permissions for the BDS with a debug
print yields the same values after this patch and with QEMU 8.1,
namely 3 and 5.
* PBS block driver: compile unconditionally. Proxmox VE always needs
it and something in the build process changed to make it not enabled
by default. Probably would need to move the build option to meson
otherwise.
* backup: job unreferencing during cleanup needs to happen outside of
coroutine, so it was moved to before invoking the clean
* mirror: Cherry-pick stable fix to avoid potential deadlock.
* savevm-async: migrate_init now can fail, so propagate potential
error.
* savevm-async: compression counters are not accessible outside
migration/ram-compress now, so drop code that prophylactically set
it to zero.
[0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/
[1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
|
|
|
|
@@ -754,8 +753,9 @@ void ide_cancel_dma_sync(IDEState *s)
|
2023-03-09 16:37:34 +03:00
|
|
|
|
*/
|
|
|
|
|
if (s->bus->dma->aiocb) {
|
|
|
|
|
trace_ide_cancel_dma_sync_remaining();
|
|
|
|
|
- blk_drain(s->blk);
|
|
|
|
|
- assert(s->bus->dma->aiocb == NULL);
|
|
|
|
|
+ while (s->bus->dma->aiocb) {
|
|
|
|
|
+ blk_drain(s->blk);
|
|
|
|
|
+ }
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|