| de252f79 | 12-Nov-2025 |
Eric Blake <eblake@redhat.com> |
qio: Add QIONetListener API for using AioContext
The user calling himself "John Doe" reported a deadlock when attempting to use qemu-storage-daemon to serve both a base file over NBD, and a qcow2 fi
qio: Add QIONetListener API for using AioContext
The user calling himself "John Doe" reported a deadlock when attempting to use qemu-storage-daemon to serve both a base file over NBD, and a qcow2 file with that NBD export as its backing file, from the same process, even though it worked just fine when there were two q-s-d processes. The bulk of the NBD server code properly uses coroutines to make progress in an event-driven manner, but the code for spawning a new coroutine at the point when listen(2) detects a new client was hard-coded to use the global GMainContext; in other words, the callback that triggers nbd_client_new to let the server start the negotiation sequence with the client requires the main loop to be making progress. However, the code for bdrv_open of a qcow2 image with an NBD backing file uses an AIO_WAIT_WHILE nested event loop to ensure that the entire qcow2 backing chain is either fully loaded or rejected, without any side effects from the main loop causing unwanted changes to the disk being loaded (in short, an AioContext represents the set of actions that are known to be safe while handling block layer I/O, while excluding any other pending actions in the global main loop with potentially larger risk of unwanted side effects).
This creates a classic case of deadlock: the server can't progress to the point of accept(2)ing the client to write to the NBD socket because the main loop is being starved until the AIO_WAIT_WHILE completes the bdrv_open, but the AIO_WAIT_WHILE can't progress because it is blocked on the client coroutine stuck in a read() of the expected magic number from the server side of the socket.
This patch adds a new API to allow clients to opt in to listening via an AioContext rather than a GMainContext. This will allow NBD to fix the deadlock by performing all actions during bdrv_open in the main loop AioContext.
Technical debt warning: I would have loved to utilize a notify function with AioContext to guarantee that we don't finalize listener due to an object_unref if there is any callback still running (the way GSource does), but wiring up notify functions into AioContext is a bigger task that will be deferred to a later QEMU release. But for solving the NBD deadlock, it is sufficient to note that the QMP commands for enabling and disabling the NBD server are really the only points where we want to change the listener's callback. Furthermore, those commands are serviced in the main loop, which is the same AioContext that is also listening for connections. Since a thread cannot interrupt itself, we are ensured that at the point where we are changing the watch, there are no callbacks active. This is NOT as powerful as the GSource cross-thread safety, but sufficient for the needs of today.
An upcoming patch will then add a unit test (kept separate to make it easier to rearrange the series to demonstrate the deadlock without this patch).
Fixes: https://gitlab.com/qemu-project/qemu/-/issues/3169 Signed-off-by: Eric Blake <eblake@redhat.com> Message-ID: <20251113011625.878876-26-eblake@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
show more ...
|
| cc0faf82 | 12-Nov-2025 |
Eric Blake <eblake@redhat.com> |
qio: Prepare NetListener to use AioContext
For ease of review, this patch adds an AioContext pointer to the QIONetListener struct, the code to trace it, and refactors listener->io_source to instead
qio: Prepare NetListener to use AioContext
For ease of review, this patch adds an AioContext pointer to the QIONetListener struct, the code to trace it, and refactors listener->io_source to instead be an array of utility structs; but the aio_context pointer is always NULL until the next patch adds an API to set it. There should be no semantic change in this patch.
Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Message-ID: <20251113011625.878876-25-eblake@redhat.com>
show more ...
|
| ec59a65a | 12-Nov-2025 |
Eric Blake <eblake@redhat.com> |
qio: Provide accessor around QIONetListener->sioc
An upcoming patch needs to pass more than just sioc as the opaque pointer to an AioContext; but since our AioContext code in general (and its QIO Ch
qio: Provide accessor around QIONetListener->sioc
An upcoming patch needs to pass more than just sioc as the opaque pointer to an AioContext; but since our AioContext code in general (and its QIO Channel wrapper code) lacks a notify callback present with GSource, we do not have the trivial option of just g_malloc'ing a small struct to hold all that data coupled with a notify of g_free. Instead, the data pointer must outlive the registered handler; in fact, having the data pointer have the same lifetime as QIONetListener is adequate.
But the cleanest way to stick such a helper struct in QIONetListener will be to rearrange internal struct members. And that in turn means that all existing code that currently directly accesses listener->nsioc and listener->sioc[] should instead go through accessor functions, to be immune to the upcoming struct layout changes. So this patch adds accessor methods qio_net_listener_nsioc() and qio_net_listener_sioc(), and puts them to use.
While at it, notice that the pattern of grabbing an sioc from the listener only to turn around can call qio_channel_socket_get_local_address is common enough to also warrant the helper of qio_net_listener_get_local_address, and fix a copy-paste error in the corresponding documentation.
Signed-off-by: Eric Blake <eblake@redhat.com> Message-ID: <20251113011625.878876-24-eblake@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
show more ...
|
| 9d861818 | 12-Nov-2025 |
Eric Blake <eblake@redhat.com> |
qio: Protect NetListener callback with mutex
Without a mutex, NetListener can run into this data race between a thread changing the async callback callback function to use when a client connects, an
qio: Protect NetListener callback with mutex
Without a mutex, NetListener can run into this data race between a thread changing the async callback callback function to use when a client connects, and the thread servicing polling of the listening sockets:
Thread 1: qio_net_listener_set_client_func(lstnr, f1, ...); => foreach sock: socket => object_ref(lstnr) => sock_src = qio_channel_socket_add_watch_source(sock, ...., lstnr, object_unref);
Thread 2: poll() => event POLLIN on socket => ref(GSourceCallback) => if (lstnr->io_func) // while lstnr->io_func is f1 ...interrupt..
Thread 1: qio_net_listener_set_client_func(lstnr, f2, ...); => foreach sock: socket => g_source_unref(sock_src) => foreach sock: socket => object_ref(lstnr) => sock_src = qio_channel_socket_add_watch_source(sock, ...., lstnr, object_unref);
Thread 2: => call lstnr->io_func(lstnr->io_data) // now sees f2 => return dispatch(sock) => unref(GSourceCallback) => destroy-notify => object_unref
Found by inspection; I did not spend the time trying to add sleeps or execute under gdb to try and actually trigger the race in practice. This is a SEGFAULT waiting to happen if f2 can become NULL because thread 1 deregisters the user's callback while thread 2 is trying to service the callback. Other messes are also theoretically possible, such as running callback f1 with an opaque pointer that should only be passed to f2 (if the client code were to use more than just a binary choice between a single async function or NULL).
Mitigating factor: if the code that modifies the QIONetListener can only be reached by the same thread that is executing the polling and async callbacks, then we are not in a two-thread race documented above (even though poll can see two clients trying to connect in the same window of time, any changes made to the listener by the first async callback will be completed before the thread moves on to the second client). However, QEMU is complex enough that this is hard to generically analyze. If QMP commands (like nbd-server-stop) are run in the main loop and the listener uses the main loop, things should be okay. But when a client uses an alternative GMainContext, or if servicing a QMP command hands off to a coroutine to avoid blocking, I am unable to state with certainty whether a given net listener can be modified by a thread different from the polling thread running callbacks.
At any rate, it is worth having the API be robust. To ensure that modifying a NetListener can be safely done from any thread, add a mutex that guarantees atomicity to all members of a listener object related to callbacks. This problem has been present since QIONetListener was introduced.
Note that this does NOT prevent the case of a second round of the user's old async callback being invoked with the old opaque data, even when the user has already tried to change the async callback during the first async callback; it is only about ensuring that there is no sharding (the eventual io_func(io_data) call that does get made will correspond to a particular combination that the user had requested at some point in time, and not be sharded to a combination that never existed in practice). In other words, this patch maintains the status quo that a user's async callback function already needs to be robust to parallel clients landing in the same window of poll servicing, even when only one client is desired, if that particular listener can be amended in a thread other than the one doing the polling.
CC: qemu-stable@nongnu.org Fixes: 53047392 ("io: introduce a network socket listener API", v2.12.0) Signed-off-by: Eric Blake <eblake@redhat.com> Message-ID: <20251113011625.878876-20-eblake@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> [eblake: minor commit message wording improvements] Signed-off-by: Eric Blake <eblake@redhat.com>
show more ...
|
| b5676493 | 12-Nov-2025 |
Eric Blake <eblake@redhat.com> |
qio: Remember context of qio_net_listener_set_client_func_full
io/net-listener.c has two modes of use: asynchronous (the user calls qio_net_listener_set_client_func to wake up the callback via the g
qio: Remember context of qio_net_listener_set_client_func_full
io/net-listener.c has two modes of use: asynchronous (the user calls qio_net_listener_set_client_func to wake up the callback via the global GMainContext, or qio_net_listener_set_client_func_full to wake up the callback via the caller's own alternative GMainContext), and synchronous (the user calls qio_net_listener_wait_client which creates its own GMainContext and waits for the first client connection before returning, with no need for a user's callback). But commit 938c8b79 has a latent logic flaw: when qio_net_listener_wait_client finishes on its temporary context, it reverts all of the siocs back to the global GMainContext rather than the potentially non-NULL context they might have been originally registered with. Similarly, if the user creates a net-listener, adds initial addresses, registers an async callback with a non-default context (which ties to all siocs for the initial addresses), then adds more addresses with qio_net_listener_add, the siocs for later addresses are blindly placed in the global context, rather than sharing the context of the earlier ones.
In practice, I don't think this has caused issues. As pointed out by the original commit, all async callers prior to that commit were already okay with the NULL default context; and the typical usage pattern is to first add ALL the addresses the listener will pay attention to before ever setting the async callback. Likewise, if a file uses only qio_net_listener_set_client_func instead of qio_net_listener_set_client_func_full, then it is never using a custom context, so later assignments of async callbacks will still be to the same global context as earlier ones. Meanwhile, any callers that want to do the sync operation to grab the first client are unlikely to register an async callback; altogether bypassing the question of whether later assignments of a GSource are being tied to a different context over time.
I do note that chardev/char-socket.c is the only file that calls both qio_net_listener_wait_client (sync for a single client in tcp_chr_accept_server_sync), and qio_net_listener_set_client_func_full (several places, all with chr->gcontext, but sometimes with a NULL callback function during teardown). But as far as I can tell, the two uses are mutually exclusive, based on the is_waitconnect parameter to qmp_chardev_open_socket_server.
That said, it is more robust to remember when an async callback function is tied to a non-default context, and have both the sync wait and any late address additions honor that same context. That way, the code will be robust even if a later user performs a sync wait for a specific client in the middle of servicing a longer-lived QIONetListener that has an async callback for all other clients.
CC: qemu-stable@nongnu.org Fixes: 938c8b79 ("qio: store gsources for net listeners", v2.12.0) Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Message-ID: <20251113011625.878876-19-eblake@redhat.com>
show more ...
|
| 84005f4a | 24-Oct-2025 |
Manish Mishra <manish.mishra@nutanix.com> |
io: flush zerocopy socket error queue on sendmsg failure due to ENOBUF
The kernel allocates extra metadata SKBs in case of a zerocopy send, eventually used for zerocopy's notification mechanism. Thi
io: flush zerocopy socket error queue on sendmsg failure due to ENOBUF
The kernel allocates extra metadata SKBs in case of a zerocopy send, eventually used for zerocopy's notification mechanism. This metadata memory is accounted for in the OPTMEM limit. The kernel queues completion notifications on the socket error queue and this error queue is freed when userspace reads it.
Usually, in the case of in-order processing, the kernel will batch the notifications and merge the metadata into a single SKB and free the rest. As a result, it never exceeds the OPTMEM limit. However, if there is any out-of-order processing or intermittent zerocopy failures, this error chain can grow significantly, exhausting the OPTMEM limit. As a result, all new sendmsg requests fail to allocate any new SKB, leading to an ENOBUF error. Depending on the amount of data queued before the flush (i.e., large live migration iterations), even large OPTMEM limits are prone to failure.
To work around this, if we encounter an ENOBUF error with a zerocopy sendmsg, flush the error queue and retry once more.
Co-authored-by: Manish Mishra <manish.mishra@nutanix.com> Signed-off-by: Tejus GK <tejus.gk@nutanix.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> [DB: change TRUE/FALSE to true/false for 'bool' type; add more #ifdef QEMU_MSG_ZEROCOPY blocks] Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
show more ...
|
| d14c8cc6 | 16-Sep-2025 |
Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> |
io/channel-socket: rework qio_channel_socket_copy_fds()
We want to switch from qemu_socket_set_block() to newer qemu_set_blocking(), which provides return status of operation, to handle errors.
Sti
io/channel-socket: rework qio_channel_socket_copy_fds()
We want to switch from qemu_socket_set_block() to newer qemu_set_blocking(), which provides return status of operation, to handle errors.
Still, we want to keep qio_channel_socket_readv() interface clean, as currently it allocate @fds only on success.
So, in case of error, we should close all incoming fds and keep user's @fds untouched or zero.
Let's make separate functions qio_channel_handle_fds() and qio_channel_cleanup_fds(), to achieve what we want.
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
show more ...
|
| 1ed89039 | 16-Sep-2025 |
Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> |
treewide: handle result of qio_channel_set_blocking()
Currently, we just always pass NULL as errp argument. That doesn't look good.
Some realizations of interface may actually report errors. Channe
treewide: handle result of qio_channel_set_blocking()
Currently, we just always pass NULL as errp argument. That doesn't look good.
Some realizations of interface may actually report errors. Channel-socket realization actually either ignore or crash on errors, but we are going to straighten it out to always reporting an errp in further commits.
So, convert all callers to either handle the error (where environment allows) or explicitly use &error_abort.
Take also a chance to change the return value to more convenient bool (keeping also in mind, that underlying realizations may return -1 on failure, not -errno).
Suggested-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> [DB: fix return type mismatch in TLS/websocket channel impls for qio_channel_set_blocking] Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
show more ...
|
| d343f395 | 10-Sep-2025 |
Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> |
io/channel: document how qio_channel_readv_full() handles fds
The only realization, which may have incoming fds is qio_channel_socket_readv() (in io/channel-socket.c). qio_channel_socket_readv() do
io/channel: document how qio_channel_readv_full() handles fds
The only realization, which may have incoming fds is qio_channel_socket_readv() (in io/channel-socket.c). qio_channel_socket_readv() do call (through qio_channel_socket_copy_fds()) qemu_socket_set_block() and qemu_set_cloexec() for each fd.
Also, qio_channel_socket_copy_fds() is called at the end of qio_channel_socket_readv(), on success path.
Acked-by: Peter Xu <peterx@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
show more ...
|
| 322d873b | 07-Feb-2025 |
Fabiano Rosas <farosas@suse.de> |
io: Add a read flag for relaxed EOF
Add a read flag that can inform a channel that it's ok to receive an EOF at any moment. Channels that have some form of strict EOF tracking, such as TLS session t
io: Add a read flag for relaxed EOF
Add a read flag that can inform a channel that it's ok to receive an EOF at any moment. Channels that have some form of strict EOF tracking, such as TLS session termination, may choose to ignore EOF errors with the use of this flag.
This is being added for compatibility with older migration streams that do not include a TLS termination step.
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>
show more ...
|
| f1cfe394 | 29-Feb-2024 |
Nikolay Borisov <nborisov@suse.com> |
io: Add generic pwritev/preadv interface
Introduce basic pwritev/preadv support in the generic channel layer. Specific implementation will follow for the file channel as this is required in order to
io: Add generic pwritev/preadv interface
Introduce basic pwritev/preadv support in the generic channel layer. Specific implementation will follow for the file channel as this is required in order to support migration streams with fixed location of each ram page.
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: "Daniel P. Berrangé" <berrange@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de> Link: https://lore.kernel.org/r/20240229153017.2221-4-farosas@suse.de Signed-off-by: Peter Xu <peterx@redhat.com>
show more ...
|