1Using Multiple ``IOThread``\ s
2==============================
3
4..
5   Copyright (c) 2014-2017 Red Hat Inc.
6
7   This work is licensed under the terms of the GNU GPL, version 2 or later.  See
8   the COPYING file in the top-level directory.
9
10
11This document explains the ``IOThread`` feature and how to write code that runs
12outside the BQL.
13
14The main loop and ``IOThread``\ s
15---------------------------------
16QEMU is an event-driven program that can do several things at once using an
17event loop.  The VNC server and the QMP monitor are both processed from the
18same event loop, which monitors their file descriptors until they become
19readable and then invokes a callback.
20
21The default event loop is called the main loop (see ``main-loop.c``).  It is
22possible to create additional event loop threads using
23``-object iothread,id=my-iothread``.
24
25Side note: The main loop and ``IOThread`` are both event loops but their code is
26not shared completely.  Sometimes it is useful to remember that although they
27are conceptually similar they are currently not interchangeable.
28
29Why ``IOThread``\ s are useful
30------------------------------
31``IOThread``\ s allow the user to control the placement of work.  The main loop is a
32scalability bottleneck on hosts with many CPUs.  Work can be spread across
33several ``IOThread``\ s instead of just one main loop.  When set up correctly this
34can improve I/O latency and reduce jitter seen by the guest.
35
36The main loop is also deeply associated with the BQL, which is a
37scalability bottleneck in itself.  vCPU threads and the main loop use the BQL
38to serialize execution of QEMU code.  This mutex is necessary because a lot of
39QEMU's code historically was not thread-safe.
40
41The fact that all I/O processing is done in a single main loop and that the
42BQL is contended by all vCPU threads and the main loop explain
43why it is desirable to place work into ``IOThread``\ s.
44
45The experimental ``virtio-blk`` data-plane implementation has been benchmarked and
46shows these effects:
47ftp://public.dhe.ibm.com/linux/pdfs/KVM_Virtualized_IO_Performance_Paper.pdf
48
49.. _how-to-program:
50
51How to program for ``IOThread``\ s
52----------------------------------
53The main difference between legacy code and new code that can run in an
54``IOThread`` is dealing explicitly with the event loop object, ``AioContext``
55(see ``include/block/aio.h``).  Code that only works in the main loop
56implicitly uses the main loop's ``AioContext``.  Code that supports running
57in ``IOThread``\ s must be aware of its ``AioContext``.
58
59AioContext supports the following services:
60 * File descriptor monitoring (read/write/error on POSIX hosts)
61 * Event notifiers (inter-thread signalling)
62 * Timers
63 * Bottom Halves (BH) deferred callbacks
64
65There are several old APIs that use the main loop AioContext:
66 * LEGACY ``qemu_aio_set_fd_handler()`` - monitor a file descriptor
67 * LEGACY ``qemu_aio_set_event_notifier()`` - monitor an event notifier
68 * LEGACY ``timer_new_ms()`` - create a timer
69 * LEGACY ``qemu_bh_new()`` - create a BH
70 * LEGACY ``qemu_bh_new_guarded()`` - create a BH with a device re-entrancy guard
71 * LEGACY ``qemu_aio_wait()`` - run an event loop iteration
72
73Since they implicitly work on the main loop they cannot be used in code that
74runs in an ``IOThread``.  They might cause a crash or deadlock if called from an
75``IOThread`` since the BQL is not held.
76
77Instead, use the ``AioContext`` functions directly (see ``include/block/aio.h``):
78 * ``aio_set_fd_handler()`` - monitor a file descriptor
79 * ``aio_set_event_notifier()`` - monitor an event notifier
80 * ``aio_timer_new()`` - create a timer
81 * ``aio_bh_new()`` - create a BH
82 * ``aio_bh_new_guarded()`` - create a BH with a device re-entrancy guard
83 * ``aio_poll()`` - run an event loop iteration
84
85The ``qemu_bh_new_guarded``/``aio_bh_new_guarded`` APIs accept a
86``MemReentrancyGuard``
87argument, which is used to check for and prevent re-entrancy problems. For
88BHs associated with devices, the reentrancy-guard is contained in the
89corresponding ``DeviceState`` and named ``mem_reentrancy_guard``.
90
91The ``AioContext`` can be obtained from the ``IOThread`` using
92``iothread_get_aio_context()`` or for the main loop using
93``qemu_get_aio_context()``. Code that takes an ``AioContext`` argument
94works both in ``IOThread``\ s or the main loop, depending on which ``AioContext``
95instance the caller passes in.
96
97How to synchronize with an ``IOThread``
98---------------------------------------
99Variables that can be accessed by multiple threads require some form of
100synchronization such as ``qemu_mutex_lock()``, ``rcu_read_lock()``, etc.
101
102``AioContext`` functions like ``aio_set_fd_handler()``,
103``aio_set_event_notifier()``, ``aio_bh_new()``, and ``aio_timer_new()``
104are thread-safe. They can be used to trigger activity in an ``IOThread``.
105
106Side note: the best way to schedule a function call across threads is to call
107``aio_bh_schedule_oneshot()``.
108
109The main loop thread can wait synchronously for a condition using
110``AIO_WAIT_WHILE()``.
111
112``AioContext`` and the block layer
113----------------------------------
114The ``AioContext`` originates from the QEMU block layer, even though nowadays
115``AioContext`` is a generic event loop that can be used by any QEMU subsystem.
116
117The block layer has support for ``AioContext`` integrated.  Each
118``BlockDriverState`` is associated with an ``AioContext`` using
119``bdrv_try_change_aio_context()`` and ``bdrv_get_aio_context()``.
120This allows block layer code to process I/O inside the
121right ``AioContext``.  Other subsystems may wish to follow a similar approach.
122
123Block layer code must therefore expect to run in an ``IOThread`` and avoid using
124old APIs that implicitly use the main loop.  See
125`How to program for IOThreads`_ for information on how to do that.
126
127Code running in the monitor typically needs to ensure that past
128requests from the guest are completed.  When a block device is running
129in an ``IOThread``, the ``IOThread`` can also process requests from the guest
130(via ioeventfd).  To achieve both objects, wrap the code between
131``bdrv_drained_begin()`` and ``bdrv_drained_end()``, thus creating a "drained
132section".
133
134Long-running jobs (usually in the form of coroutines) are often scheduled in
135the ``BlockDriverState``'s ``AioContext``.  The functions
136``bdrv_add``/``remove_aio_context_notifier``, or alternatively
137``blk_add``/``remove_aio_context_notifier`` if you use ``BlockBackends``,
138can be used to get a notification whenever ``bdrv_try_change_aio_context()``
139moves a ``BlockDriverState`` to a different ``AioContext``.
140