1(RDMA: Remote Direct Memory Access) 2RDMA Live Migration Specification, Version # 1 3============================================== 4Wiki: http://wiki.qemu.org/Features/RDMALiveMigration 5Github: git@github.com:hinesmr/qemu.git, 'rdma' branch 6 7Copyright (C) 2013 Michael R. Hines <mrhines@us.ibm.com> 8 9An *exhaustive* paper (2010) shows additional performance details 10linked on the QEMU wiki above. 11 12Contents: 13========= 14* Introduction 15* Before running 16* Running 17* Performance 18* RDMA Migration Protocol Description 19* Versioning and Capabilities 20* QEMUFileRDMA Interface 21* Migration of pc.ram 22* Error handling 23* TODO 24 25Introduction: 26============= 27 28RDMA helps make your migration more deterministic under heavy load because 29of the significantly lower latency and higher throughput over TCP/IP. This is 30because the RDMA I/O architecture reduces the number of interrupts and 31data copies by bypassing the host networking stack. In particular, a TCP-based 32migration, under certain types of memory-bound workloads, may take a more 33unpredicatable amount of time to complete the migration if the amount of 34memory tracked during each live migration iteration round cannot keep pace 35with the rate of dirty memory produced by the workload. 36 37RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA 38over Convered Ethernet) as well as Infiniband-based. This implementation of 39migration using RDMA is capable of using both technologies because of 40the use of the OpenFabrics OFED software stack that abstracts out the 41programming model irrespective of the underlying hardware. 42 43Refer to openfabrics.org or your respective RDMA hardware vendor for 44an understanding on how to verify that you have the OFED software stack 45installed in your environment. You should be able to successfully link 46against the "librdmacm" and "libibverbs" libraries and development headers 47for a working build of QEMU to run successfully using RDMA Migration. 48 49BEFORE RUNNING: 50=============== 51 52Use of RDMA during migration requires pinning and registering memory 53with the hardware. This means that memory must be physically resident 54before the hardware can transmit that memory to another machine. 55If this is not acceptable for your application or product, then the use 56of RDMA migration may in fact be harmful to co-located VMs or other 57software on the machine if there is not sufficient memory available to 58relocate the entire footprint of the virtual machine. If so, then the 59use of RDMA is discouraged and it is recommended to use standard TCP migration. 60 61Experimental: Next, decide if you want dynamic page registration. 62For example, if you have an 8GB RAM virtual machine, but only 1GB 63is in active use, then enabling this feature will cause all 8GB to 64be pinned and resident in memory. This feature mostly affects the 65bulk-phase round of the migration and can be enabled for extremely 66high-performance RDMA hardware using the following command: 67 68QEMU Monitor Command: 69$ migrate_set_capability x-rdma-pin-all on # disabled by default 70 71Performing this action will cause all 8GB to be pinned, so if that's 72not what you want, then please ignore this step altogether. 73 74On the other hand, this will also significantly speed up the bulk round 75of the migration, which can greatly reduce the "total" time of your migration. 76Example performance of this using an idle VM in the previous example 77can be found in the "Performance" section. 78 79Note: for very large virtual machines (hundreds of GBs), pinning all 80*all* of the memory of your virtual machine in the kernel is very expensive 81may extend the initial bulk iteration time by many seconds, 82and thus extending the total migration time. However, this will not 83affect the determinism or predictability of your migration you will 84still gain from the benefits of advanced pinning with RDMA. 85 86RUNNING: 87======== 88 89First, set the migration speed to match your hardware's capabilities: 90 91QEMU Monitor Command: 92$ migrate_set_speed 40g # or whatever is the MAX of your RDMA device 93 94Next, on the destination machine, add the following to the QEMU command line: 95 96qemu ..... -incoming x-rdma:host:port 97 98Finally, perform the actual migration on the source machine: 99 100QEMU Monitor Command: 101$ migrate -d x-rdma:host:port 102 103PERFORMANCE 104=========== 105 106Here is a brief summary of total migration time and downtime using RDMA: 107Using a 40gbps infiniband link performing a worst-case stress test, 108using an 8GB RAM virtual machine: 109 110Using the following command: 111$ apt-get install stress 112$ stress --vm-bytes 7500M --vm 1 --vm-keep 113 1141. Migration throughput: 26 gigabits/second. 1152. Downtime (stop time) varies between 15 and 100 milliseconds. 116 117EFFECTS of memory registration on bulk phase round: 118 119For example, in the same 8GB RAM example with all 8GB of memory in 120active use and the VM itself is completely idle using the same 40 gbps 121infiniband link: 122 1231. x-rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps 1242. x-rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps 125 126These numbers would of course scale up to whatever size virtual machine 127you have to migrate using RDMA. 128 129Enabling this feature does *not* have any measurable affect on 130migration *downtime*. This is because, without this feature, all of the 131memory will have already been registered already in advance during 132the bulk round and does not need to be re-registered during the successive 133iteration rounds. 134 135RDMA Protocol Description: 136========================== 137 138Migration with RDMA is separated into two parts: 139 1401. The transmission of the pages using RDMA 1412. Everything else (a control channel is introduced) 142 143"Everything else" is transmitted using a formal 144protocol now, consisting of infiniband SEND messages. 145 146An infiniband SEND message is the standard ibverbs 147message used by applications of infiniband hardware. 148The only difference between a SEND message and an RDMA 149message is that SEND messages cause notifications 150to be posted to the completion queue (CQ) on the 151infiniband receiver side, whereas RDMA messages (used 152for pc.ram) do not (to behave like an actual DMA). 153 154Messages in infiniband require two things: 155 1561. registration of the memory that will be transmitted 1572. (SEND only) work requests to be posted on both 158 sides of the network before the actual transmission 159 can occur. 160 161RDMA messages are much easier to deal with. Once the memory 162on the receiver side is registered and pinned, we're 163basically done. All that is required is for the sender 164side to start dumping bytes onto the link. 165 166(Memory is not released from pinning until the migration 167completes, given that RDMA migrations are very fast.) 168 169SEND messages require more coordination because the 170receiver must have reserved space (using a receive 171work request) on the receive queue (RQ) before QEMUFileRDMA 172can start using them to carry all the bytes as 173a control transport for migration of device state. 174 175To begin the migration, the initial connection setup is 176as follows (migration-rdma.c): 177 1781. Receiver and Sender are started (command line or libvirt): 1792. Both sides post two RQ work requests 1803. Receiver does listen() 1814. Sender does connect() 1825. Receiver accept() 1836. Check versioning and capabilities (described later) 184 185At this point, we define a control channel on top of SEND messages 186which is described by a formal protocol. Each SEND message has a 187header portion and a data portion (but together are transmitted 188as a single SEND message). 189 190Header: 191 * Length (of the data portion, uint32, network byte order) 192 * Type (what command to perform, uint32, network byte order) 193 * Repeat (Number of commands in data portion, same type only) 194 195The 'Repeat' field is here to support future multiple page registrations 196in a single message without any need to change the protocol itself 197so that the protocol is compatible against multiple versions of QEMU. 198Version #1 requires that all server implementations of the protocol must 199check this field and register all requests found in the array of commands located 200in the data portion and return an equal number of results in the response. 201The maximum number of repeats is hard-coded to 4096. This is a conservative 202limit based on the maximum size of a SEND message along with emperical 203observations on the maximum future benefit of simultaneous page registrations. 204 205The 'type' field has 10 different command values: 206 1. Unused 207 2. Error (sent to the source during bad things) 208 3. Ready (control-channel is available) 209 4. QEMU File (for sending non-live device state) 210 5. RAM Blocks request (used right after connection setup) 211 6. RAM Blocks result (used right after connection setup) 212 7. Compress page (zap zero page and skip registration) 213 8. Register request (dynamic chunk registration) 214 9. Register result ('rkey' to be used by sender) 215 10. Register finished (registration for current iteration finished) 216 217A single control message, as hinted above, can contain within the data 218portion an array of many commands of the same type. If there is more than 219one command, then the 'repeat' field will be greater than 1. 220 221After connection setup, message 5 & 6 are used to exchange ram block 222information and optionally pin all the memory if requested by the user. 223 224After ram block exchange is completed, we have two protocol-level 225functions, responsible for communicating control-channel commands 226using the above list of values: 227 228Logically: 229 230qemu_rdma_exchange_recv(header, expected command type) 231 2321. We transmit a READY command to let the sender know that 233 we are *ready* to receive some data bytes on the control channel. 2342. Before attempting to receive the expected command, we post another 235 RQ work request to replace the one we just used up. 2363. Block on a CQ event channel and wait for the SEND to arrive. 2374. When the send arrives, librdmacm will unblock us. 2385. Verify that the command-type and version received matches the one we expected. 239 240qemu_rdma_exchange_send(header, data, optional response header & data): 241 2421. Block on the CQ event channel waiting for a READY command 243 from the receiver to tell us that the receiver 244 is *ready* for us to transmit some new bytes. 2452. Optionally: if we are expecting a response from the command 246 (that we have no yet transmitted), let's post an RQ 247 work request to receive that data a few moments later. 2483. When the READY arrives, librdmacm will 249 unblock us and we immediately post a RQ work request 250 to replace the one we just used up. 2514. Now, we can actually post the work request to SEND 252 the requested command type of the header we were asked for. 2535. Optionally, if we are expecting a response (as before), 254 we block again and wait for that response using the additional 255 work request we previously posted. (This is used to carry 256 'Register result' commands #6 back to the sender which 257 hold the rkey need to perform RDMA. Note that the virtual address 258 corresponding to this rkey was already exchanged at the beginning 259 of the connection (described below). 260 261All of the remaining command types (not including 'ready') 262described above all use the aformentioned two functions to do the hard work: 263 2641. After connection setup, RAMBlock information is exchanged using 265 this protocol before the actual migration begins. This information includes 266 a description of each RAMBlock on the server side as well as the virtual addresses 267 and lengths of each RAMBlock. This is used by the client to determine the 268 start and stop locations of chunks and how to register them dynamically 269 before performing the RDMA operations. 2702. During runtime, once a 'chunk' becomes full of pages ready to 271 be sent with RDMA, the registration commands are used to ask the 272 other side to register the memory for this chunk and respond 273 with the result (rkey) of the registration. 2743. Also, the QEMUFile interfaces also call these functions (described below) 275 when transmitting non-live state, such as devices or to send 276 its own protocol information during the migration process. 2774. Finally, zero pages are only checked if a page has not yet been registered 278 using chunk registration (or not checked at all and unconditionally 279 written if chunk registration is disabled. This is accomplished using 280 the "Compress" command listed above. If the page *has* been registered 281 then we check the entire chunk for zero. Only if the entire chunk is 282 zero, then we send a compress command to zap the page on the other side. 283 284Versioning and Capabilities 285=========================== 286Current version of the protocol is version #1. 287 288The same version applies to both for protocol traffic and capabilities 289negotiation. (i.e. There is only one version number that is referred to 290by all communication). 291 292librdmacm provides the user with a 'private data' area to be exchanged 293at connection-setup time before any infiniband traffic is generated. 294 295Header: 296 * Version (protocol version validated before send/recv occurs), uint32, network byte order 297 * Flags (bitwise OR of each capability), uint32, network byte order 298 299There is no data portion of this header right now, so there is 300no length field. The maximum size of the 'private data' section 301is only 192 bytes per the Infiniband specification, so it's not 302very useful for data anyway. This structure needs to remain small. 303 304This private data area is a convenient place to check for protocol 305versioning because the user does not need to register memory to 306transmit a few bytes of version information. 307 308This is also a convenient place to negotiate capabilities 309(like dynamic page registration). 310 311If the version is invalid, we throw an error. 312 313If the version is new, we only negotiate the capabilities that the 314requested version is able to perform and ignore the rest. 315 316Currently there is only *one* capability in Version #1: dynamic page registration 317 318Finally: Negotiation happens with the Flags field: If the primary-VM 319sets a flag, but the destination does not support this capability, it 320will return a zero-bit for that flag and the primary-VM will understand 321that as not being an available capability and will thus disable that 322capability on the primary-VM side. 323 324QEMUFileRDMA Interface: 325======================= 326 327QEMUFileRDMA introduces a couple of new functions: 328 3291. qemu_rdma_get_buffer() (QEMUFileOps rdma_read_ops) 3302. qemu_rdma_put_buffer() (QEMUFileOps rdma_write_ops) 331 332These two functions are very short and simply use the protocol 333describe above to deliver bytes without changing the upper-level 334users of QEMUFile that depend on a bytestream abstraction. 335 336Finally, how do we handoff the actual bytes to get_buffer()? 337 338Again, because we're trying to "fake" a bytestream abstraction 339using an analogy not unlike individual UDP frames, we have 340to hold on to the bytes received from control-channel's SEND 341messages in memory. 342 343Each time we receive a complete "QEMU File" control-channel 344message, the bytes from SEND are copied into a small local holding area. 345 346Then, we return the number of bytes requested by get_buffer() 347and leave the remaining bytes in the holding area until get_buffer() 348comes around for another pass. 349 350If the buffer is empty, then we follow the same steps 351listed above and issue another "QEMU File" protocol command, 352asking for a new SEND message to re-fill the buffer. 353 354Migration of pc.ram: 355==================== 356 357At the beginning of the migration, (migration-rdma.c), 358the sender and the receiver populate the list of RAMBlocks 359to be registered with each other into a structure. 360Then, using the aforementioned protocol, they exchange a 361description of these blocks with each other, to be used later 362during the iteration of main memory. This description includes 363a list of all the RAMBlocks, their offsets and lengths, virtual 364addresses and possibly includes pre-registered RDMA keys in case dynamic 365page registration was disabled on the server-side, otherwise not. 366 367Main memory is not migrated with the aforementioned protocol, 368but is instead migrated with normal RDMA Write operations. 369 370Pages are migrated in "chunks" (hard-coded to 1 Megabyte right now). 371Chunk size is not dynamic, but it could be in a future implementation. 372There's nothing to indicate that this is useful right now. 373 374When a chunk is full (or a flush() occurs), the memory backed by 375the chunk is registered with librdmacm is pinned in memory on 376both sides using the aforementioned protocol. 377After pinning, an RDMA Write is generated and transmitted 378for the entire chunk. 379 380Chunks are also transmitted in batches: This means that we 381do not request that the hardware signal the completion queue 382for the completion of *every* chunk. The current batch size 383is about 64 chunks (corresponding to 64 MB of memory). 384Only the last chunk in a batch must be signaled. 385This helps keep everything as asynchronous as possible 386and helps keep the hardware busy performing RDMA operations. 387 388Error-handling: 389=============== 390 391Infiniband has what is called a "Reliable, Connected" 392link (one of 4 choices). This is the mode in which 393we use for RDMA migration. 394 395If a *single* message fails, 396the decision is to abort the migration entirely and 397cleanup all the RDMA descriptors and unregister all 398the memory. 399 400After cleanup, the Virtual Machine is returned to normal 401operation the same way that would happen if the TCP 402socket is broken during a non-RDMA based migration. 403 404TODO: 405===== 4061. 'migrate x-rdma:host:port' and '-incoming x-rdma' options will be 407 renamed to 'rdma' after the experimental phase of this work has 408 completed upstream. 4092. Currently, 'ulimit -l' mlock() limits as well as cgroups swap limits 410 are not compatible with infinband memory pinning and will result in 411 an aborted migration (but with the source VM left unaffected). 4123. Use of the recent /proc/<pid>/pagemap would likely speed up 413 the use of KSM and ballooning while using RDMA. 4144. Also, some form of balloon-device usage tracking would also 415 help alleviate some issues. 416