1.. SPDX-License-Identifier: GPL-2.0 2 3VMbus 4===== 5VMbus is a software construct provided by Hyper-V to guest VMs. It 6consists of a control path and common facilities used by synthetic 7devices that Hyper-V presents to guest VMs. The control path is 8used to offer synthetic devices to the guest VM and, in some cases, 9to rescind those devices. The common facilities include software 10channels for communicating between the device driver in the guest VM 11and the synthetic device implementation that is part of Hyper-V, and 12signaling primitives to allow Hyper-V and the guest to interrupt 13each other. 14 15VMbus is modeled in Linux as a bus, with the expected /sys/bus/vmbus 16entry in a running Linux guest. The VMbus driver (drivers/hv/vmbus_drv.c) 17establishes the VMbus control path with the Hyper-V host, then 18registers itself as a Linux bus driver. It implements the standard 19bus functions for adding and removing devices to/from the bus. 20 21Most synthetic devices offered by Hyper-V have a corresponding Linux 22device driver. These devices include: 23 24* SCSI controller 25* NIC 26* Graphics frame buffer 27* Keyboard 28* Mouse 29* PCI device pass-thru 30* Heartbeat 31* Time Sync 32* Shutdown 33* Memory balloon 34* Key/Value Pair (KVP) exchange with Hyper-V 35* Hyper-V online backup (a.k.a. VSS) 36 37Guest VMs may have multiple instances of the synthetic SCSI 38controller, synthetic NIC, and PCI pass-thru devices. Other 39synthetic devices are limited to a single instance per VM. Not 40listed above are a small number of synthetic devices offered by 41Hyper-V that are used only by Windows guests and for which Linux 42does not have a driver. 43 44Hyper-V uses the terms "VSP" and "VSC" in describing synthetic 45devices. "VSP" refers to the Hyper-V code that implements a 46particular synthetic device, while "VSC" refers to the driver for 47the device in the guest VM. For example, the Linux driver for the 48synthetic NIC is referred to as "netvsc" and the Linux driver for 49the synthetic SCSI controller is "storvsc". These drivers contain 50functions with names like "storvsc_connect_to_vsp". 51 52VMbus channels 53-------------- 54An instance of a synthetic device uses VMbus channels to communicate 55between the VSP and the VSC. Channels are bi-directional and used 56for passing messages. Most synthetic devices use a single channel, 57but the synthetic SCSI controller and synthetic NIC may use multiple 58channels to achieve higher performance and greater parallelism. 59 60Each channel consists of two ring buffers. These are classic ring 61buffers from a university data structures textbook. If the read 62and writes pointers are equal, the ring buffer is considered to be 63empty, so a full ring buffer always has at least one byte unused. 64The "in" ring buffer is for messages from the Hyper-V host to the 65guest, and the "out" ring buffer is for messages from the guest to 66the Hyper-V host. In Linux, the "in" and "out" designations are as 67viewed by the guest side. The ring buffers are memory that is 68shared between the guest and the host, and they follow the standard 69paradigm where the memory is allocated by the guest, with the list 70of GPAs that make up the ring buffer communicated to the host. Each 71ring buffer consists of a header page (4 Kbytes) with the read and 72write indices and some control flags, followed by the memory for the 73actual ring. The size of the ring is determined by the VSC in the 74guest and is specific to each synthetic device. The list of GPAs 75making up the ring is communicated to the Hyper-V host over the 76VMbus control path as a GPA Descriptor List (GPADL). See function 77vmbus_establish_gpadl(). 78 79Each ring buffer is mapped into contiguous Linux kernel virtual 80space in three parts: 1) the 4 Kbyte header page, 2) the memory 81that makes up the ring itself, and 3) a second mapping of the memory 82that makes up the ring itself. Because (2) and (3) are contiguous 83in kernel virtual space, the code that copies data to and from the 84ring buffer need not be concerned with ring buffer wrap-around. 85Once a copy operation has completed, the read or write index may 86need to be reset to point back into the first mapping, but the 87actual data copy does not need to be broken into two parts. This 88approach also allows complex data structures to be easily accessed 89directly in the ring without handling wrap-around. 90 91On arm64 with page sizes > 4 Kbytes, the header page must still be 92passed to Hyper-V as a 4 Kbyte area. But the memory for the actual 93ring must be aligned to PAGE_SIZE and have a size that is a multiple 94of PAGE_SIZE so that the duplicate mapping trick can be done. Hence 95a portion of the header page is unused and not communicated to 96Hyper-V. This case is handled by vmbus_establish_gpadl(). 97 98Hyper-V enforces a limit on the aggregate amount of guest memory 99that can be shared with the host via GPADLs. This limit ensures 100that a rogue guest can't force the consumption of excessive host 101resources. For Windows Server 2019 and later, this limit is 102approximately 1280 Mbytes. For versions prior to Windows Server 1032019, the limit is approximately 384 Mbytes. 104 105VMbus messages 106-------------- 107All VMbus messages have a standard header that includes the message 108length, the offset of the message payload, some flags, and a 109transactionID. The portion of the message after the header is 110unique to each VSP/VSC pair. 111 112Messages follow one of two patterns: 113 114* Unidirectional: Either side sends a message and does not 115 expect a response message 116* Request/response: One side (usually the guest) sends a message 117 and expects a response 118 119The transactionID (a.k.a. "requestID") is for matching requests & 120responses. Some synthetic devices allow multiple requests to be in- 121flight simultaneously, so the guest specifies a transactionID when 122sending a request. Hyper-V sends back the same transactionID in the 123matching response. 124 125Messages passed between the VSP and VSC are control messages. For 126example, a message sent from the storvsc driver might be "execute 127this SCSI command". If a message also implies some data transfer 128between the guest and the Hyper-V host, the actual data to be 129transferred may be embedded with the control message, or it may be 130specified as a separate data buffer that the Hyper-V host will 131access as a DMA operation. The former case is used when the size of 132the data is small and the cost of copying the data to and from the 133ring buffer is minimal. For example, time sync messages from the 134Hyper-V host to the guest contain the actual time value. When the 135data is larger, a separate data buffer is used. In this case, the 136control message contains a list of GPAs that describe the data 137buffer. For example, the storvsc driver uses this approach to 138specify the data buffers to/from which disk I/O is done. 139 140Three functions exist to send VMbus messages: 141 1421. vmbus_sendpacket(): Control-only messages and messages with 143 embedded data -- no GPAs 1442. vmbus_sendpacket_pagebuffer(): Message with list of GPAs 145 identifying data to transfer. An offset and length is 146 associated with each GPA so that multiple discontinuous areas 147 of guest memory can be targeted. 1483. vmbus_sendpacket_mpb_desc(): Message with list of GPAs 149 identifying data to transfer. A single offset and length is 150 associated with a list of GPAs. The GPAs must describe a 151 single logical area of guest memory to be targeted. 152 153Historically, Linux guests have trusted Hyper-V to send well-formed 154and valid messages, and Linux drivers for synthetic devices did not 155fully validate messages. With the introduction of processor 156technologies that fully encrypt guest memory and that allow the 157guest to not trust the hypervisor (AMD SNP-SEV, Intel TDX), trusting 158the Hyper-V host is no longer a valid assumption. The drivers for 159VMbus synthetic devices are being updated to fully validate any 160values read from memory that is shared with Hyper-V, which includes 161messages from VMbus devices. To facilitate such validation, 162messages read by the guest from the "in" ring buffer are copied to a 163temporary buffer that is not shared with Hyper-V. Validation is 164performed in this temporary buffer without the risk of Hyper-V 165maliciously modifying the message after it is validated but before 166it is used. 167 168VMbus interrupts 169---------------- 170VMbus provides a mechanism for the guest to interrupt the host when 171the guest has queued new messages in a ring buffer. The host 172expects that the guest will send an interrupt only when an "out" 173ring buffer transitions from empty to non-empty. If the guest sends 174interrupts at other times, the host deems such interrupts to be 175unnecessary. If a guest sends an excessive number of unnecessary 176interrupts, the host may throttle that guest by suspending its 177execution for a few seconds to prevent a denial-of-service attack. 178 179Similarly, the host will interrupt the guest when it sends a new 180message on the VMbus control path, or when a VMbus channel "in" ring 181buffer transitions from empty to non-empty. Each CPU in the guest 182may receive VMbus interrupts, so they are best modeled as per-CPU 183interrupts in Linux. This model works well on arm64 where a single 184per-CPU IRQ is allocated for VMbus. Since x86/x64 lacks support for 185per-CPU IRQs, an x86 interrupt vector is statically allocated (see 186HYPERVISOR_CALLBACK_VECTOR) across all CPUs and explicitly coded to 187call the VMbus interrupt service routine. These interrupts are 188visible in /proc/interrupts on the "HYP" line. 189 190The guest CPU that a VMbus channel will interrupt is selected by the 191guest when the channel is created, and the host is informed of that 192selection. VMbus devices are broadly grouped into two categories: 193 1941. "Slow" devices that need only one VMbus channel. The devices 195 (such as keyboard, mouse, heartbeat, and timesync) generate 196 relatively few interrupts. Their VMbus channels are all 197 assigned to interrupt the VMBUS_CONNECT_CPU, which is always 198 CPU 0. 199 2002. "High speed" devices that may use multiple VMbus channels for 201 higher parallelism and performance. These devices include the 202 synthetic SCSI controller and synthetic NIC. Their VMbus 203 channels interrupts are assigned to CPUs that are spread out 204 among the available CPUs in the VM so that interrupts on 205 multiple channels can be processed in parallel. 206 207The assignment of VMbus channel interrupts to CPUs is done in the 208function init_vp_index(). This assignment is done outside of the 209normal Linux interrupt affinity mechanism, so the interrupts are 210neither "unmanaged" nor "managed" interrupts. 211 212The CPU that a VMbus channel will interrupt can be seen in 213/sys/bus/vmbus/devices/<deviceGUID>/ channels/<channelRelID>/cpu. 214When running on later versions of Hyper-V, the CPU can be changed 215by writing a new value to this sysfs entry. Because the interrupt 216assignment is done outside of the normal Linux affinity mechanism, 217there are no entries in /proc/irq corresponding to individual 218VMbus channel interrupts. 219 220An online CPU in a Linux guest may not be taken offline if it has 221VMbus channel interrupts assigned to it. Any such channel 222interrupts must first be manually reassigned to another CPU as 223described above. When no channel interrupts are assigned to the 224CPU, it can be taken offline. 225 226When a guest CPU receives a VMbus interrupt from the host, the 227function vmbus_isr() handles the interrupt. It first checks for 228channel interrupts by calling vmbus_chan_sched(), which looks at a 229bitmap setup by the host to determine which channels have pending 230interrupts on this CPU. If multiple channels have pending 231interrupts for this CPU, they are processed sequentially. When all 232channel interrupts have been processed, vmbus_isr() checks for and 233processes any message received on the VMbus control path. 234 235The VMbus channel interrupt handling code is designed to work 236correctly even if an interrupt is received on a CPU other than the 237CPU assigned to the channel. Specifically, the code does not use 238CPU-based exclusion for correctness. In normal operation, Hyper-V 239will interrupt the assigned CPU. But when the CPU assigned to a 240channel is being changed via sysfs, the guest doesn't know exactly 241when Hyper-V will make the transition. The code must work correctly 242even if there is a time lag before Hyper-V starts interrupting the 243new CPU. See comments in target_cpu_store(). 244 245VMbus device creation/deletion 246------------------------------ 247Hyper-V and the Linux guest have a separate message-passing path 248that is used for synthetic device creation and deletion. This 249path does not use a VMbus channel. See vmbus_post_msg() and 250vmbus_on_msg_dpc(). 251 252The first step is for the guest to connect to the generic 253Hyper-V VMbus mechanism. As part of establishing this connection, 254the guest and Hyper-V agree on a VMbus protocol version they will 255use. This negotiation allows newer Linux kernels to run on older 256Hyper-V versions, and vice versa. 257 258The guest then tells Hyper-V to "send offers". Hyper-V sends an 259offer message to the guest for each synthetic device that the VM 260is configured to have. Each VMbus device type has a fixed GUID 261known as the "class ID", and each VMbus device instance is also 262identified by a GUID. The offer message from Hyper-V contains 263both GUIDs to uniquely (within the VM) identify the device. 264There is one offer message for each device instance, so a VM with 265two synthetic NICs will get two offers messages with the NIC 266class ID. The ordering of offer messages can vary from boot-to-boot 267and must not be assumed to be consistent in Linux code. Offer 268messages may also arrive long after Linux has initially booted 269because Hyper-V supports adding devices, such as synthetic NICs, 270to running VMs. A new offer message is processed by 271vmbus_process_offer(), which indirectly invokes vmbus_add_channel_work(). 272 273Upon receipt of an offer message, the guest identifies the device 274type based on the class ID, and invokes the correct driver to set up 275the device. Driver/device matching is performed using the standard 276Linux mechanism. 277 278The device driver probe function opens the primary VMbus channel to 279the corresponding VSP. It allocates guest memory for the channel 280ring buffers and shares the ring buffer with the Hyper-V host by 281giving the host a list of GPAs for the ring buffer memory. See 282vmbus_establish_gpadl(). 283 284Once the ring buffer is set up, the device driver and VSP exchange 285setup messages via the primary channel. These messages may include 286negotiating the device protocol version to be used between the Linux 287VSC and the VSP on the Hyper-V host. The setup messages may also 288include creating additional VMbus channels, which are somewhat 289mis-named as "sub-channels" since they are functionally 290equivalent to the primary channel once they are created. 291 292Finally, the device driver may create entries in /dev as with 293any device driver. 294 295The Hyper-V host can send a "rescind" message to the guest to 296remove a device that was previously offered. Linux drivers must 297handle such a rescind message at any time. Rescinding a device 298invokes the device driver "remove" function to cleanly shut 299down the device and remove it. Once a synthetic device is 300rescinded, neither Hyper-V nor Linux retains any state about 301its previous existence. Such a device might be re-added later, 302in which case it is treated as an entirely new device. See 303vmbus_onoffer_rescind(). 304