You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
qemu/block
Hanna Czenczek b002acacc1 Revert "nvme: Fix coroutine waking"
This reverts commit 0f142cbd91.

Said commit changed the replay_bh_schedule_oneshot_event() in
nvme_rw_cb() to aio_co_wake(), allowing the request coroutine to be
entered directly (instead of only being scheduled for later execution).
This can cause the device to become stalled like so:

It is possible that after completion the request coroutine goes on to
submit another request without yielding, e.g. a flush after a write to
emulate FUA.  This will likely cause a nested nvme_process_completion()
call because nvme_rw_cb() itself is called from there.

(After submitting a request, we invoke nvme_process_completion() through
defer_call(); but the fact that nvme_process_completion() ran in the
first place indicates that we are not in a call-deferring section, so
defer_call() will call nvme_process_completion() immediately.)

If this inner nvme_process_completion() loop then processes any
completions, it will write the final completion queue (CQ) head index to
the CQ head doorbell, and subsequently execution will return to the
outer nvme_process_completion() loop.  Even if this loop now finds no
further completions, it still processed at least one completion before,
or it would not have called the nvme_rw_cb() which led to nesting.
Therefore, it will now write the exact same CQ head index value to the
doorbell, which effectively is an unrecoverable error[1].

Therefore, nesting of nvme_process_completion() does not work at this
point.  Reverting said commit removes the nesting (by scheduling the
request coroutine instead of entering it immediately), and so fixes the
stall.

On the downside, reverting said commit breaks multiqueue for nvme, but
better to have single-queue working than neither.  For 11.0, we will
have a solution that makes both work.

A side note: There is a comment in nvme_process_completion() above
qemu_bh_schedule() that claims nesting works, as long as it is done
through the completion_bh.  I am quite sure that is not true, for two
reasons:
- The problem described above, which is even worse when going through
  nvme_process_completion_bh() because that function unconditionally
  writes to the CQ head doorbell,
- nvme_process_completion_bh() never takes q->lock, so
  nvme_process_completion() unlocking it will likely abort.

Given the lack of reports of such aborts, I believe that completion_bh
simply is unused in practice.

[1] See the NVMe Base Specification revision 2.3, page 180, figure 152:
    “Invalid Doorbell Write Value: A host attempted to write an invalid
    doorbell value. Some possible causes of this error are: [...] the
    value written is the same as the previously written doorbell value.”

    To even be notified of this error, we would need to send an
    Asynchronous Event Request to the admin queue (p. 178ff), which we
    don’t do, and then to handle it, we would need to delete and
    recreate the queue (p. 88, section 3.3.1.2 Queue Usage).

Cc: qemu-stable@nongnu.org
Reported-by: Lukáš Doktor <ldoktor@redhat.com>
Tested-by: Lukáš Doktor <ldoktor@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-id: 20251215141540.88915-1-hreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
5 days ago
..
export block/export: Add option to allow export of inactive nodes 11 months ago
monitor block/monitor: Use hmp_handle_error to report error 2 months ago
accounting.c block: enable stats-intervals for storage devices 2 months ago
aio_task.c block: Remove unused aio_task_pool_empty 1 year ago
amend.c block: Mark BlockDriver callbacks for amend job GRAPH_RDLOCK 3 years ago
backup.c block: add bdrv_graph_wrlock_drained() convenience wrapper 5 months ago
blkdebug.c block: Expand block status mode from bool to flags 7 months ago
blkio.c include/system: Move exec/memory.h to system/memory.h 8 months ago
blklogwrites.c block: add bdrv_graph_wrlock_drained() convenience wrapper 5 months ago
blkreplay.c blkreplay: Run BH in coroutine’s AioContext 1 month ago
blkverify.c block: add bdrv_graph_wrlock_drained() convenience wrapper 5 months ago
block-backend.c block: use pwrite_zeroes_alignment when writing first sector 4 weeks ago
block-copy.c include: Rename sysemu/ -> system/ 1 year ago
block-gen.h block-coroutine-wrapper.py: support also basic return types 3 years ago
block-ram-registrar.c include: Rename sysemu/ -> system/ 1 year ago
bochs.c block: replace TABs with space 1 month ago
cloop.c block: Take graph lock for most of .bdrv_open 2 years ago
commit.c block/commit: mark commit_abort() as GRAPH_UNLOCKED 5 months ago
copy-before-write.c block: Expand block status mode from bool to flags 7 months ago
copy-before-write.h blockdev-backup: Add error handling option for copy-before-write jobs 7 months ago
copy-on-read.c qapi: Move include/qapi/qmp/ to include/qobject/ 10 months ago
copy-on-read.h block: Mark bdrv_(un)freeze_backing_chain() and callers GRAPH_RDLOCK 2 years ago
coroutines.h block: Expand block status mode from bool to flags 7 months ago
create.c qemu/compiler: Absorb 'clang-tsa.h' 10 months ago
crypto.c block: Allow drivers to control protocol prefix at creation 1 month ago
crypto.h block: Support detached LUKS header creation using qemu-img 2 years ago
curl.c curl: Fix coroutine waking 1 month ago
dirty-bitmap.c block: Mark bdrv_*_dirty_bitmap() and callers GRAPH_RDLOCK 3 years ago
dmg-bz2.c Include qemu-common.h exactly where needed 7 years ago
dmg-lzfse.c block/dmg: Ignore C99 prototype declaration mismatch from <lzfse.h> 3 years ago
dmg.c block: Protect bs->file with graph_lock 2 years ago
dmg.h block/dmg: Declare a type definition for DMG uncompress function 3 years ago
file-posix.c file-posix: Handle suspended dm-multipath better for SG_IO 2 weeks ago
file-win32.c block: replace TABs with space 1 month ago
filter-compress.c block: Take graph lock for most of .bdrv_open 2 years ago
gluster.c gluster: Do not move coroutine into BDS context 1 month ago
graph-lock.c block: add bdrv_graph_wrlock_drained() convenience wrapper 5 months ago
io.c block/io: Take reqs_lock for tracked_requests 1 month ago
io_uring.c block/io_uring: use non-vectored read/write when possible 1 month ago
iscsi-opts.c modules: add block module annotations 5 years ago
iscsi.c iscsi: Create AIO BH in original AioContext 1 month ago
linux-aio.c block: skip automatic zero-init of large array in ioq_submit 6 months ago
meson.build include: Rename sysemu/ -> system/ 1 year ago
mirror.c block: drop wrapper for bdrv_set_backing_hd_drained() 5 months ago
nbd.c treewide: handle result of qio_channel_set_blocking() 3 months ago
nfs.c nfs: Run co BH CB in the coroutine’s AioContext 1 month ago
null.c null-aio: Run CB in original AioContext 1 month ago
nvme.c Revert "nvme: Fix coroutine waking" 5 days ago
parallels-ext.c qapi/crypto: Rename QCryptoHashAlgorithm to *Algo, and drop prefix 1 year ago
parallels.c block: Allow drivers to control protocol prefix at creation 1 month ago
parallels.h block: Protect bs->file with graph_lock 2 years ago
preallocate.c block: Protect bs->file with graph_lock 2 years ago
progress_meter.c coroutine: Clean up superfluous inclusion of qemu/lockable.h 3 years ago
qapi-system.c qapi: Move include/qapi/qmp/ to include/qobject/ 10 months ago
qapi.c qemu-img info: Optionally show block limits 2 months ago
qcow.c block: Allow drivers to control protocol prefix at creation 1 month ago
qcow2-bitmap.c block/qcow2-bitmap: Replace g_memdup() by g_memdup2() 2 years ago
qcow2-cache.c qcow2: Mark qcow2_signal_corruption() and callers GRAPH_RDLOCK 2 years ago
qcow2-cluster.c qcow2: put discards in discard queue when discard-no-unref is enabled 1 month ago
qcow2-refcount.c qcow2: put discards in discard queue when discard-no-unref is enabled 1 month ago
qcow2-snapshot.c include: Rename sysemu/ -> system/ 1 year ago
qcow2-threads.c thread-pool: avoid passing the pool parameter every time 3 years ago
qcow2.c qcow2: Schedule cache-clean-timer in realtime 1 month ago
qcow2.h qcow2: Fix cache_clean_timer 1 month ago
qed-check.c qed: mark more functions as coroutine_fns and GRAPH_RDLOCK 3 years ago
qed-cluster.c qed: protect table cache with CoMutex 9 years ago
qed-l2-cache.c osdep: Move memalign-related functions to their own header 4 years ago
qed-table.c block: use bdrv_co_debug_event in coroutine context 3 years ago
qed.c block: Allow drivers to control protocol prefix at creation 1 month ago
qed.h block: Protect bs->file with graph_lock 2 years ago
quorum.c block: add bdrv_graph_wrlock_drained() convenience wrapper 5 months ago
raw-format.c block: Allow drivers to control protocol prefix at creation 1 month ago
rbd.c rbd: Run co BH CB in the coroutine’s AioContext 1 month ago
replication.c block: mark bdrv_reopen_queue() and bdrv_reopen_multiple() as GRAPH_UNLOCKED 5 months ago
reqlist.c block/reqlist: allow adding overlapping requests 1 year ago
snapshot-access.c block: Expand block status mode from bool to flags 7 months ago
snapshot.c block: add bdrv_graph_wrlock_drained() convenience wrapper 5 months ago
ssh.c ssh: Run restart_coroutine in current AioContext 1 month ago
stream.c block/stream: mark stream_prepare() as GRAPH_UNLOCKED 5 months ago
throttle-groups.c qom: Make InterfaceInfo[] uses const 8 months ago
throttle.c block: Take graph lock for most of .bdrv_open 2 years ago
trace-events block/io_uring: use aio_add_sqe() 1 month ago
trace.h trace: switch position of headers to what Meson requires 5 years ago
vdi.c block: Allow drivers to control protocol prefix at creation 1 month ago
vhdx-endian.c Include qemu-common.h exactly where needed 7 years ago
vhdx-log.c vhdx: Take locks for accessing bs->file 2 years ago
vhdx.c block: Allow drivers to control protocol prefix at creation 1 month ago
vhdx.h vhdx: Take locks for accessing bs->file 2 years ago
vmdk.c Fix const qualifier build errors with recent glibc 2 weeks ago
vpc.c block: Allow drivers to control protocol prefix at creation 1 month ago
vvfat.c Fix const qualifier build errors with recent glibc 2 weeks ago
win32-aio.c win32-aio: Run CB in original context 1 month ago
write-threshold.c block: remove AioContext locking 2 years ago