1132db935SJakub Kicinski.. SPDX-License-Identifier: GPL-2.0 2132db935SJakub Kicinski 3132db935SJakub Kicinski============================================================ 4132db935SJakub KicinskiLinux kernel driver for Elastic Network Adapter (ENA) family 5132db935SJakub Kicinski============================================================ 6132db935SJakub Kicinski 7132db935SJakub KicinskiOverview 8132db935SJakub Kicinski======== 9132db935SJakub Kicinski 10132db935SJakub KicinskiENA is a networking interface designed to make good use of modern CPU 11132db935SJakub Kicinskifeatures and system architectures. 12132db935SJakub Kicinski 13132db935SJakub KicinskiThe ENA device exposes a lightweight management interface with a 14511c537bSShay Agroskinminimal set of memory mapped registers and extendible command set 15132db935SJakub Kicinskithrough an Admin Queue. 16132db935SJakub Kicinski 17132db935SJakub KicinskiThe driver supports a range of ENA devices, is link-speed independent 18511c537bSShay Agroskin(i.e., the same driver is used for 10GbE, 25GbE, 40GbE, etc), and has 19511c537bSShay Agroskina negotiated and extendible feature set. 20132db935SJakub Kicinski 21132db935SJakub KicinskiSome ENA devices support SR-IOV. This driver is used for both the 22132db935SJakub KicinskiSR-IOV Physical Function (PF) and Virtual Function (VF) devices. 23132db935SJakub Kicinski 24132db935SJakub KicinskiENA devices enable high speed and low overhead network traffic 25132db935SJakub Kicinskiprocessing by providing multiple Tx/Rx queue pairs (the maximum number 26132db935SJakub Kicinskiis advertised by the device via the Admin Queue), a dedicated MSI-X 27132db935SJakub Kicinskiinterrupt vector per Tx/Rx queue pair, adaptive interrupt moderation, 28132db935SJakub Kicinskiand CPU cacheline optimized data placement. 29132db935SJakub Kicinski 30511c537bSShay AgroskinThe ENA driver supports industry standard TCP/IP offload features such as 31511c537bSShay Agroskinchecksum offload. Receive-side scaling (RSS) is supported for multi-core 32511c537bSShay Agroskinscaling. 33132db935SJakub Kicinski 34132db935SJakub KicinskiThe ENA driver and its corresponding devices implement health 35132db935SJakub Kicinskimonitoring mechanisms such as watchdog, enabling the device and driver 36132db935SJakub Kicinskito recover in a manner transparent to the application, as well as 37132db935SJakub Kicinskidebug logs. 38132db935SJakub Kicinski 39132db935SJakub KicinskiSome of the ENA devices support a working mode called Low-latency 40132db935SJakub KicinskiQueue (LLQ), which saves several more microseconds. 415dfbbaa2SDavid Arinzon 42132db935SJakub KicinskiENA Source Code Directory Structure 43132db935SJakub Kicinski=================================== 44132db935SJakub Kicinski 45132db935SJakub Kicinski================= ====================================================== 46132db935SJakub Kicinskiena_com.[ch] Management communication layer. This layer is 47132db935SJakub Kicinski responsible for the handling all the management 48132db935SJakub Kicinski (admin) communication between the device and the 49132db935SJakub Kicinski driver. 50132db935SJakub Kicinskiena_eth_com.[ch] Tx/Rx data path. 51132db935SJakub Kicinskiena_admin_defs.h Definition of ENA management interface. 52132db935SJakub Kicinskiena_eth_io_defs.h Definition of ENA data path interface. 53132db935SJakub Kicinskiena_common_defs.h Common definitions for ena_com layer. 54132db935SJakub Kicinskiena_regs_defs.h Definition of ENA PCI memory-mapped (MMIO) registers. 55132db935SJakub Kicinskiena_netdev.[ch] Main Linux kernel driver. 56132db935SJakub Kicinskiena_ethtool.c ethtool callbacks. 57*c891d767SDavid Arinzonena_xdp.[ch] XDP files 58132db935SJakub Kicinskiena_pci_id_tbl.h Supported device IDs. 59132db935SJakub Kicinski================= ====================================================== 60132db935SJakub Kicinski 61132db935SJakub KicinskiManagement Interface: 62132db935SJakub Kicinski===================== 63132db935SJakub Kicinski 64132db935SJakub KicinskiENA management interface is exposed by means of: 65132db935SJakub Kicinski 66132db935SJakub Kicinski- PCIe Configuration Space 67132db935SJakub Kicinski- Device Registers 68132db935SJakub Kicinski- Admin Queue (AQ) and Admin Completion Queue (ACQ) 69132db935SJakub Kicinski- Asynchronous Event Notification Queue (AENQ) 70132db935SJakub Kicinski 71132db935SJakub KicinskiENA device MMIO Registers are accessed only during driver 72511c537bSShay Agroskininitialization and are not used during further normal device 73132db935SJakub Kicinskioperation. 74132db935SJakub Kicinski 75132db935SJakub KicinskiAQ is used for submitting management commands, and the 76132db935SJakub Kicinskiresults/responses are reported asynchronously through ACQ. 77132db935SJakub Kicinski 78132db935SJakub KicinskiENA introduces a small set of management commands with room for 79132db935SJakub Kicinskivendor-specific extensions. Most of the management operations are 80132db935SJakub Kicinskiframed in a generic Get/Set feature command. 81132db935SJakub Kicinski 82132db935SJakub KicinskiThe following admin queue commands are supported: 83132db935SJakub Kicinski 84132db935SJakub Kicinski- Create I/O submission queue 85132db935SJakub Kicinski- Create I/O completion queue 86132db935SJakub Kicinski- Destroy I/O submission queue 87132db935SJakub Kicinski- Destroy I/O completion queue 88132db935SJakub Kicinski- Get feature 89132db935SJakub Kicinski- Set feature 90132db935SJakub Kicinski- Configure AENQ 91132db935SJakub Kicinski- Get statistics 92132db935SJakub Kicinski 93132db935SJakub KicinskiRefer to ena_admin_defs.h for the list of supported Get/Set Feature 94132db935SJakub Kicinskiproperties. 95132db935SJakub Kicinski 96132db935SJakub KicinskiThe Asynchronous Event Notification Queue (AENQ) is a uni-directional 97132db935SJakub Kicinskiqueue used by the ENA device to send to the driver events that cannot 98132db935SJakub Kicinskibe reported using ACQ. AENQ events are subdivided into groups. Each 99132db935SJakub Kicinskigroup may have multiple syndromes, as shown below 100132db935SJakub Kicinski 101132db935SJakub KicinskiThe events are: 102132db935SJakub Kicinski 103132db935SJakub Kicinski==================== =============== 104132db935SJakub KicinskiGroup Syndrome 105132db935SJakub Kicinski==================== =============== 106132db935SJakub KicinskiLink state change **X** 107132db935SJakub KicinskiFatal error **X** 108132db935SJakub KicinskiNotification Suspend traffic 109132db935SJakub KicinskiNotification Resume traffic 110132db935SJakub KicinskiKeep-Alive **X** 111132db935SJakub Kicinski==================== =============== 112132db935SJakub Kicinski 113132db935SJakub KicinskiACQ and AENQ share the same MSI-X vector. 114132db935SJakub Kicinski 115511c537bSShay AgroskinKeep-Alive is a special mechanism that allows monitoring the device's health. 116511c537bSShay AgroskinA Keep-Alive event is delivered by the device every second. 117511c537bSShay AgroskinThe driver maintains a watchdog (WD) handler which logs the current state and 118511c537bSShay Agroskinstatistics. If the keep-alive events aren't delivered as expected the WD resets 119511c537bSShay Agroskinthe device and the driver. 120132db935SJakub Kicinski 121132db935SJakub KicinskiData Path Interface 122132db935SJakub Kicinski=================== 123511c537bSShay Agroskin 124132db935SJakub KicinskiI/O operations are based on Tx and Rx Submission Queues (Tx SQ and Rx 125132db935SJakub KicinskiSQ correspondingly). Each SQ has a completion queue (CQ) associated 126132db935SJakub Kicinskiwith it. 127132db935SJakub Kicinski 128132db935SJakub KicinskiThe SQs and CQs are implemented as descriptor rings in contiguous 129132db935SJakub Kicinskiphysical memory. 130132db935SJakub Kicinski 131132db935SJakub KicinskiThe ENA driver supports two Queue Operation modes for Tx SQs: 132132db935SJakub Kicinski 133511c537bSShay Agroskin- **Regular mode:** 134511c537bSShay Agroskin In this mode the Tx SQs reside in the host's memory. The ENA 135132db935SJakub Kicinski device fetches the ENA Tx descriptors and packet data from host 136132db935SJakub Kicinski memory. 137132db935SJakub Kicinski 138511c537bSShay Agroskin- **Low Latency Queue (LLQ) mode or "push-mode":** 139511c537bSShay Agroskin In this mode the driver pushes the transmit descriptors and the 140273a2397SArthur Kiyanovski first 96 bytes of the packet directly to the ENA device memory 141132db935SJakub Kicinski space. The rest of the packet payload is fetched by the 142132db935SJakub Kicinski device. For this operation mode, the driver uses a dedicated PCI 143132db935SJakub Kicinski device memory BAR, which is mapped with write-combine capability. 144132db935SJakub Kicinski 145511c537bSShay Agroskin **Note that** not all ENA devices support LLQ, and this feature is negotiated 146132db935SJakub Kicinski with the device upon initialization. If the ENA device does not 147132db935SJakub Kicinski support LLQ mode, the driver falls back to the regular mode. 148132db935SJakub Kicinski 149511c537bSShay AgroskinThe Rx SQs support only the regular mode. 150511c537bSShay Agroskin 151132db935SJakub KicinskiThe driver supports multi-queue for both Tx and Rx. This has various 152132db935SJakub Kicinskibenefits: 153132db935SJakub Kicinski 154132db935SJakub Kicinski- Reduced CPU/thread/process contention on a given Ethernet interface. 155132db935SJakub Kicinski- Cache miss rate on completion is reduced, particularly for data 156132db935SJakub Kicinski cache lines that hold the sk_buff structures. 157132db935SJakub Kicinski- Increased process-level parallelism when handling received packets. 158132db935SJakub Kicinski- Increased data cache hit rate, by steering kernel processing of 159132db935SJakub Kicinski packets to the CPU, where the application thread consuming the 160132db935SJakub Kicinski packet is running. 161132db935SJakub Kicinski- In hardware interrupt re-direction. 162132db935SJakub Kicinski 163132db935SJakub KicinskiInterrupt Modes 164132db935SJakub Kicinski=============== 165511c537bSShay Agroskin 166132db935SJakub KicinskiThe driver assigns a single MSI-X vector per queue pair (for both Tx 167132db935SJakub Kicinskiand Rx directions). The driver assigns an additional dedicated MSI-X vector 168132db935SJakub Kicinskifor management (for ACQ and AENQ). 169132db935SJakub Kicinski 170132db935SJakub KicinskiManagement interrupt registration is performed when the Linux kernel 171132db935SJakub Kicinskiprobes the adapter, and it is de-registered when the adapter is 172132db935SJakub Kicinskiremoved. I/O queue interrupt registration is performed when the Linux 173132db935SJakub Kicinskiinterface of the adapter is opened, and it is de-registered when the 174132db935SJakub Kicinskiinterface is closed. 175132db935SJakub Kicinski 176132db935SJakub KicinskiThe management interrupt is named:: 177132db935SJakub Kicinski 178132db935SJakub Kicinski ena-mgmnt@pci:<PCI domain:bus:slot.function> 179132db935SJakub Kicinski 180132db935SJakub Kicinskiand for each queue pair, an interrupt is named:: 181132db935SJakub Kicinski 182132db935SJakub Kicinski <interface name>-Tx-Rx-<queue index> 183132db935SJakub Kicinski 184132db935SJakub KicinskiThe ENA device operates in auto-mask and auto-clear interrupt 185132db935SJakub Kicinskimodes. That is, once MSI-X is delivered to the host, its Cause bit is 186132db935SJakub Kicinskiautomatically cleared and the interrupt is masked. The interrupt is 187132db935SJakub Kicinskiunmasked by the driver after NAPI processing is complete. 188132db935SJakub Kicinski 189132db935SJakub KicinskiInterrupt Moderation 190132db935SJakub Kicinski==================== 191511c537bSShay Agroskin 192132db935SJakub KicinskiENA driver and device can operate in conventional or adaptive interrupt 193132db935SJakub Kicinskimoderation mode. 194132db935SJakub Kicinski 195511c537bSShay Agroskin**In conventional mode** the driver instructs device to postpone interrupt 196132db935SJakub Kicinskiposting according to static interrupt delay value. The interrupt delay 197511c537bSShay Agroskinvalue can be configured through `ethtool(8)`. The following `ethtool` 198511c537bSShay Agroskinparameters are supported by the driver: ``tx-usecs``, ``rx-usecs`` 199132db935SJakub Kicinski 200511c537bSShay Agroskin**In adaptive interrupt** moderation mode the interrupt delay value is 201132db935SJakub Kicinskiupdated by the driver dynamically and adjusted every NAPI cycle 202132db935SJakub Kicinskiaccording to the traffic nature. 203132db935SJakub Kicinski 204511c537bSShay AgroskinAdaptive coalescing can be switched on/off through `ethtool(8)`'s 205511c537bSShay Agroskin:code:`adaptive_rx on|off` parameter. 206132db935SJakub Kicinski 207c452f375SShay AgroskinMore information about Adaptive Interrupt Moderation (DIM) can be found in 208c452f375SShay AgroskinDocumentation/networking/net_dim.rst 209132db935SJakub Kicinski 210f7d625adSDavid Arinzon.. _`RX copybreak`: 2115dfbbaa2SDavid Arinzon 212132db935SJakub KicinskiRX copybreak 213132db935SJakub Kicinski============ 214132db935SJakub KicinskiThe rx_copybreak is initialized by default to ENA_DEFAULT_RX_COPYBREAK 215132db935SJakub Kicinskiand can be configured by the ETHTOOL_STUNABLE command of the 216132db935SJakub KicinskiSIOCETHTOOL ioctl. 217132db935SJakub Kicinski 218132db935SJakub KicinskiStatistics 219132db935SJakub Kicinski========== 220511c537bSShay Agroskin 221511c537bSShay AgroskinThe user can obtain ENA device and driver statistics using `ethtool`. 222132db935SJakub KicinskiThe driver can collect regular or extended statistics (including 223132db935SJakub Kicinskiper-queue stats) from the device. 224132db935SJakub Kicinski 225132db935SJakub KicinskiIn addition the driver logs the stats to syslog upon device reset. 226132db935SJakub Kicinski 227132db935SJakub KicinskiMTU 228132db935SJakub Kicinski=== 229511c537bSShay Agroskin 230132db935SJakub KicinskiThe driver supports an arbitrarily large MTU with a maximum that is 231132db935SJakub Kicinskinegotiated with the device. The driver configures MTU using the 232132db935SJakub KicinskiSetFeature command (ENA_ADMIN_MTU property). The user can change MTU 233511c537bSShay Agroskinvia `ip(8)` and similar legacy tools. 234132db935SJakub Kicinski 235132db935SJakub KicinskiStateless Offloads 236132db935SJakub Kicinski================== 237511c537bSShay Agroskin 238132db935SJakub KicinskiThe ENA driver supports: 239132db935SJakub Kicinski 240132db935SJakub Kicinski- IPv4 header checksum offload 241132db935SJakub Kicinski- TCP/UDP over IPv4/IPv6 checksum offloads 242132db935SJakub Kicinski 243132db935SJakub KicinskiRSS 244132db935SJakub Kicinski=== 245511c537bSShay Agroskin 246132db935SJakub Kicinski- The ENA device supports RSS that allows flexible Rx traffic 247132db935SJakub Kicinski steering. 248132db935SJakub Kicinski- Toeplitz and CRC32 hash functions are supported. 249132db935SJakub Kicinski- Different combinations of L2/L3/L4 fields can be configured as 250132db935SJakub Kicinski inputs for hash functions. 251132db935SJakub Kicinski- The driver configures RSS settings using the AQ SetFeature command 252132db935SJakub Kicinski (ENA_ADMIN_RSS_HASH_FUNCTION, ENA_ADMIN_RSS_HASH_INPUT and 2530deca83fSShay Agroskin ENA_ADMIN_RSS_INDIRECTION_TABLE_CONFIG properties). 254132db935SJakub Kicinski- If the NETIF_F_RXHASH flag is set, the 32-bit result of the hash 255132db935SJakub Kicinski function delivered in the Rx CQ descriptor is set in the received 256132db935SJakub Kicinski SKB. 257132db935SJakub Kicinski- The user can provide a hash key, hash function, and configure the 258511c537bSShay Agroskin indirection table through `ethtool(8)`. 259132db935SJakub Kicinski 260132db935SJakub KicinskiDATA PATH 261132db935SJakub Kicinski========= 262511c537bSShay Agroskin 263132db935SJakub KicinskiTx 264132db935SJakub Kicinski-- 265132db935SJakub Kicinski 266511c537bSShay Agroskin:code:`ena_start_xmit()` is called by the stack. This function does the following: 267132db935SJakub Kicinski 268511c537bSShay Agroskin- Maps data buffers (``skb->data`` and frags). 269511c537bSShay Agroskin- Populates ``ena_buf`` for the push buffer (if the driver and device are 270511c537bSShay Agroskin in push mode). 271132db935SJakub Kicinski- Prepares ENA bufs for the remaining frags. 272511c537bSShay Agroskin- Allocates a new request ID from the empty ``req_id`` ring. The request 273132db935SJakub Kicinski ID is the index of the packet in the Tx info. This is used for 274511c537bSShay Agroskin out-of-order Tx completions. 275132db935SJakub Kicinski- Adds the packet to the proper place in the Tx ring. 276511c537bSShay Agroskin- Calls :code:`ena_com_prepare_tx()`, an ENA communication layer that converts 277511c537bSShay Agroskin the ``ena_bufs`` to ENA descriptors (and adds meta ENA descriptors as 278511c537bSShay Agroskin needed). 279132db935SJakub Kicinski 280132db935SJakub Kicinski * This function also copies the ENA descriptors and the push buffer 281511c537bSShay Agroskin to the Device memory space (if in push mode). 282132db935SJakub Kicinski 283511c537bSShay Agroskin- Writes a doorbell to the ENA device. 284132db935SJakub Kicinski- When the ENA device finishes sending the packet, a completion 285132db935SJakub Kicinski interrupt is raised. 286132db935SJakub Kicinski- The interrupt handler schedules NAPI. 287511c537bSShay Agroskin- The :code:`ena_clean_tx_irq()` function is called. This function handles the 288132db935SJakub Kicinski completion descriptors generated by the ENA, with a single 289132db935SJakub Kicinski completion descriptor per completed packet. 290132db935SJakub Kicinski 291511c537bSShay Agroskin * ``req_id`` is retrieved from the completion descriptor. The ``tx_info`` of 292511c537bSShay Agroskin the packet is retrieved via the ``req_id``. The data buffers are 293511c537bSShay Agroskin unmapped and ``req_id`` is returned to the empty ``req_id`` ring. 294132db935SJakub Kicinski * The function stops when the completion descriptors are completed or 295132db935SJakub Kicinski the budget is reached. 296132db935SJakub Kicinski 297132db935SJakub KicinskiRx 298132db935SJakub Kicinski-- 299132db935SJakub Kicinski 300132db935SJakub Kicinski- When a packet is received from the ENA device. 301132db935SJakub Kicinski- The interrupt handler schedules NAPI. 302511c537bSShay Agroskin- The :code:`ena_clean_rx_irq()` function is called. This function calls 303511c537bSShay Agroskin :code:`ena_com_rx_pkt()`, an ENA communication layer function, which returns the 304511c537bSShay Agroskin number of descriptors used for a new packet, and zero if 305132db935SJakub Kicinski no new packet is found. 306511c537bSShay Agroskin- :code:`ena_rx_skb()` checks packet length: 307132db935SJakub Kicinski 308132db935SJakub Kicinski * If the packet is small (len < rx_copybreak), the driver allocates 309132db935SJakub Kicinski a SKB for the new packet, and copies the packet payload into the 310132db935SJakub Kicinski SKB data buffer. 311132db935SJakub Kicinski 312132db935SJakub Kicinski - In this way the original data buffer is not passed to the stack 313132db935SJakub Kicinski and is reused for future Rx packets. 314132db935SJakub Kicinski 315511c537bSShay Agroskin * Otherwise the function unmaps the Rx buffer, sets the first 316511c537bSShay Agroskin descriptor as `skb`'s linear part and the other descriptors as the 317511c537bSShay Agroskin `skb`'s frags. 318132db935SJakub Kicinski 319132db935SJakub Kicinski- The new SKB is updated with the necessary information (protocol, 320511c537bSShay Agroskin checksum hw verify result, etc), and then passed to the network 321511c537bSShay Agroskin stack, using the NAPI interface function :code:`napi_gro_receive()`. 322f7d625adSDavid Arinzon 323f7d625adSDavid ArinzonDynamic RX Buffers (DRB) 324f7d625adSDavid Arinzon------------------------ 325f7d625adSDavid Arinzon 326f7d625adSDavid ArinzonEach RX descriptor in the RX ring is a single memory page (which is either 4KB 327f7d625adSDavid Arinzonor 16KB long depending on system's configurations). 328f7d625adSDavid ArinzonTo reduce the memory allocations required when dealing with a high rate of small 329f7d625adSDavid Arinzonpackets, the driver tries to reuse the remaining RX descriptor's space if more 330f7d625adSDavid Arinzonthan 2KB of this page remain unused. 331f7d625adSDavid Arinzon 332f7d625adSDavid ArinzonA simple example of this mechanism is the following sequence of events: 333f7d625adSDavid Arinzon 334f7d625adSDavid Arinzon:: 335f7d625adSDavid Arinzon 336f7d625adSDavid Arinzon 1. Driver allocates page-sized RX buffer and passes it to hardware 337f7d625adSDavid Arinzon +----------------------+ 338f7d625adSDavid Arinzon |4KB RX Buffer | 339f7d625adSDavid Arinzon +----------------------+ 340f7d625adSDavid Arinzon 341f7d625adSDavid Arinzon 2. A 300Bytes packet is received on this buffer 342f7d625adSDavid Arinzon 343f7d625adSDavid Arinzon 3. The driver increases the ref count on this page and returns it back to 344f7d625adSDavid Arinzon HW as an RX buffer of size 4KB - 300Bytes = 3796 Bytes 345f7d625adSDavid Arinzon +----+--------------------+ 346f7d625adSDavid Arinzon |****|3796 Bytes RX Buffer| 347f7d625adSDavid Arinzon +----+--------------------+ 348f7d625adSDavid Arinzon 349f7d625adSDavid ArinzonThis mechanism isn't used when an XDP program is loaded, or when the 350f7d625adSDavid ArinzonRX packet is less than rx_copybreak bytes (in which case the packet is 351f7d625adSDavid Arinzoncopied out of the RX buffer into the linear part of a new skb allocated 352f7d625adSDavid Arinzonfor it and the RX buffer remains the same size, see `RX copybreak`_). 353