xref: /openbmc/qemu/docs/qcow2-cache.txt (revision 9cdd2a736b99bad19fb4f88d2230c75f680c31ec)
1qcow2 L2/refcount cache configuration
2=====================================
3Copyright (C) 2015, 2018 Igalia, S.L.
4Author: Alberto Garcia <berto@igalia.com>
5
6This work is licensed under the terms of the GNU GPL, version 2 or
7later. See the COPYING file in the top-level directory.
8
9Introduction
10------------
11The QEMU qcow2 driver has two caches that can improve the I/O
12performance significantly. However, setting the right cache sizes is
13not a straightforward operation.
14
15This document attempts to give an overview of the L2 and refcount
16caches, and how to configure them.
17
18Please refer to the docs/interop/qcow2.txt file for an in-depth
19technical description of the qcow2 file format.
20
21
22Clusters
23--------
24A qcow2 file is organized in units of constant size called clusters.
25
26The cluster size is configurable, but it must be a power of two and
27its value 512 bytes or higher. QEMU currently defaults to 64 KB
28clusters, and it does not support sizes larger than 2MB.
29
30The 'qemu-img create' command supports specifying the size using the
31cluster_size option:
32
33   qemu-img create -f qcow2 -o cluster_size=128K hd.qcow2 4G
34
35
36The L2 tables
37-------------
38The qcow2 format uses a two-level structure to map the virtual disk as
39seen by the guest to the disk image in the host. These structures are
40called the L1 and L2 tables.
41
42There is one single L1 table per disk image. The table is small and is
43always kept in memory.
44
45There can be many L2 tables, depending on how much space has been
46allocated in the image. Each table is one cluster in size. In order to
47read or write data from the virtual disk, QEMU needs to read its
48corresponding L2 table to find out where that data is located. Since
49reading the table for each I/O operation can be expensive, QEMU keeps
50an L2 cache in memory to speed up disk access.
51
52The size of the L2 cache can be configured, and setting the right
53value can improve the I/O performance significantly.
54
55
56The refcount blocks
57-------------------
58The qcow2 format also mantains a reference count for each cluster.
59Reference counts are used for cluster allocation and internal
60snapshots. The data is stored in a two-level structure similar to the
61L1/L2 tables described above.
62
63The second level structures are called refcount blocks, are also one
64cluster in size and the number is also variable and dependent on the
65amount of allocated space.
66
67Each block contains a number of refcount entries. Their size (in bits)
68is a power of two and must not be higher than 64. It defaults to 16
69bits, but a different value can be set using the refcount_bits option:
70
71   qemu-img create -f qcow2 -o refcount_bits=8 hd.qcow2 4G
72
73QEMU keeps a refcount cache to speed up I/O much like the
74aforementioned L2 cache, and its size can also be configured.
75
76
77Choosing the right cache sizes
78------------------------------
79In order to choose the cache sizes we need to know how they relate to
80the amount of allocated space.
81
82The amount of virtual disk that can be mapped by the L2 and refcount
83caches (in bytes) is:
84
85   disk_size = l2_cache_size * cluster_size / 8
86   disk_size = refcount_cache_size * cluster_size * 8 / refcount_bits
87
88With the default values for cluster_size (64KB) and refcount_bits
89(16), that is
90
91   disk_size = l2_cache_size * 8192
92   disk_size = refcount_cache_size * 32768
93
94So in order to cover n GB of disk space with the default values we
95need:
96
97   l2_cache_size = disk_size_GB * 131072
98   refcount_cache_size = disk_size_GB * 32768
99
100QEMU has a default L2 cache of 1MB (1048576 bytes) and a refcount
101cache of 256KB (262144 bytes), so using the formulas we've just seen
102we have
103
104   1048576 / 131072 = 8 GB of virtual disk covered by that cache
105    262144 /  32768 = 8 GB
106
107
108How to configure the cache sizes
109--------------------------------
110Cache sizes can be configured using the -drive option in the
111command-line, or the 'blockdev-add' QMP command.
112
113There are three options available, and all of them take bytes:
114
115"l2-cache-size":         maximum size of the L2 table cache
116"refcount-cache-size":   maximum size of the refcount block cache
117"cache-size":            maximum size of both caches combined
118
119There are two things that need to be taken into account:
120
121 - Both caches must have a size that is a multiple of the cluster size
122   (or the cache entry size: see "Using smaller cache sizes" below).
123
124 - If you only set one of the options above, QEMU will automatically
125   adjust the others so that the L2 cache is 4 times bigger than the
126   refcount cache.
127
128This means that these options are equivalent:
129
130   -drive file=hd.qcow2,l2-cache-size=2097152
131   -drive file=hd.qcow2,refcount-cache-size=524288
132   -drive file=hd.qcow2,cache-size=2621440
133
134The reason for this 1/4 ratio is to ensure that both caches cover the
135same amount of disk space. Note however that this is only valid with
136the default value of refcount_bits (16). If you are using a different
137value you might want to calculate both cache sizes yourself since QEMU
138will always use the same 1/4 ratio.
139
140It's also worth mentioning that there's no strict need for both caches
141to cover the same amount of disk space. The refcount cache is used
142much less often than the L2 cache, so it's perfectly reasonable to
143keep it small.
144
145
146Using smaller cache entries
147---------------------------
148The qcow2 L2 cache stores complete tables by default. This means that
149if QEMU needs an entry from an L2 table then the whole table is read
150from disk and is kept in the cache. If the cache is full then a
151complete table needs to be evicted first.
152
153This can be inefficient with large cluster sizes since it results in
154more disk I/O and wastes more cache memory.
155
156Since QEMU 2.12 you can change the size of the L2 cache entry and make
157it smaller than the cluster size. This can be configured using the
158"l2-cache-entry-size" parameter:
159
160   -drive file=hd.qcow2,l2-cache-size=2097152,l2-cache-entry-size=4096
161
162Some things to take into account:
163
164 - The L2 cache entry size has the same restrictions as the cluster
165   size (power of two, at least 512 bytes).
166
167 - Smaller entry sizes generally improve the cache efficiency and make
168   disk I/O faster. This is particularly true with solid state drives
169   so it's a good idea to reduce the entry size in those cases. With
170   rotating hard drives the situation is a bit more complicated so you
171   should test it first and stay with the default size if unsure.
172
173 - Try different entry sizes to see which one gives faster performance
174   in your case. The block size of the host filesystem is generally a
175   good default (usually 4096 bytes in the case of ext4).
176
177 - Only the L2 cache can be configured this way. The refcount cache
178   always uses the cluster size as the entry size.
179
180 - If the L2 cache is big enough to hold all of the image's L2 tables
181   (as explained in the "Choosing the right cache sizes" section
182   earlier in this document) then none of this is necessary and you
183   can omit the "l2-cache-entry-size" parameter altogether.
184
185
186Reducing the memory usage
187-------------------------
188It is possible to clean unused cache entries in order to reduce the
189memory usage during periods of low I/O activity.
190
191The parameter "cache-clean-interval" defines an interval (in seconds).
192All cache entries that haven't been accessed during that interval are
193removed from memory.
194
195This example removes all unused cache entries every 15 minutes:
196
197   -drive file=hd.qcow2,cache-clean-interval=900
198
199If unset, the default value for this parameter is 0 and it disables
200this feature.
201
202Note that this functionality currently relies on the MADV_DONTNEED
203argument for madvise() to actually free the memory. This is a
204Linux-specific feature, so cache-clean-interval is not supported in
205other systems.
206