mirror of https://gitlab.com/qemu-project/qemu
You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
The user calling himself "John Doe" reported a deadlock when attempting to use qemu-storage-daemon to serve both a base file over NBD, and a qcow2 file with that NBD export as its backing file, from the same process, even though it worked just fine when there were two q-s-d processes. The bulk of the NBD server code properly uses coroutines to make progress in an event-driven manner, but the code for spawning a new coroutine at the point when listen(2) detects a new client was hard-coded to use the global GMainContext; in other words, the callback that triggers nbd_client_new to let the server start the negotiation sequence with the client requires the main loop to be making progress. However, the code for bdrv_open of a qcow2 image with an NBD backing file uses an AIO_WAIT_WHILE nested event loop to ensure that the entire qcow2 backing chain is either fully loaded or rejected, without any side effects from the main loop causing unwanted changes to the disk being loaded (in short, an AioContext represents the set of actions that are known to be safe while handling block layer I/O, while excluding any other pending actions in the global main loop with potentially larger risk of unwanted side effects). This creates a classic case of deadlock: the server can't progress to the point of accept(2)ing the client to write to the NBD socket because the main loop is being starved until the AIO_WAIT_WHILE completes the bdrv_open, but the AIO_WAIT_WHILE can't progress because it is blocked on the client coroutine stuck in a read() of the expected magic number from the server side of the socket. This patch adds a new API to allow clients to opt in to listening via an AioContext rather than a GMainContext. This will allow NBD to fix the deadlock by performing all actions during bdrv_open in the main loop AioContext. Technical debt warning: I would have loved to utilize a notify function with AioContext to guarantee that we don't finalize listener due to an object_unref if there is any callback still running (the way GSource does), but wiring up notify functions into AioContext is a bigger task that will be deferred to a later QEMU release. But for solving the NBD deadlock, it is sufficient to note that the QMP commands for enabling and disabling the NBD server are really the only points where we want to change the listener's callback. Furthermore, those commands are serviced in the main loop, which is the same AioContext that is also listening for connections. Since a thread cannot interrupt itself, we are ensured that at the point where we are changing the watch, there are no callbacks active. This is NOT as powerful as the GSource cross-thread safety, but sufficient for the needs of today. An upcoming patch will then add a unit test (kept separate to make it easier to rearrange the series to demonstrate the deadlock without this patch). Fixes: https://gitlab.com/qemu-project/qemu/-/issues/3169 Signed-off-by: Eric Blake <eblake@redhat.com> Message-ID: <20251113011625.878876-26-eblake@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> |
5 days ago | |
|---|---|---|
| .. | ||
| channel-buffer.h | 5 years ago | |
| channel-command.h | 3 years ago | |
| channel-file.h | 2 years ago | |
| channel-null.h | 3 years ago | |
| channel-socket.h | 5 days ago | |
| channel-tls.h | 9 months ago | |
| channel-util.h | 2 years ago | |
| channel-watch.h | 5 years ago | |
| channel-websock.h | 4 weeks ago | |
| channel.h | 2 weeks ago | |
| dns-resolver.h | 5 years ago | |
| net-listener.h | 5 days ago | |
| task.h | 2 years ago | |