xref: /openbmc/linux/Documentation/networking/device_drivers/ethernet/amazon/ena.rst (revision aeddf9a2731de8235b2b433533d06ee7dc73d233)
1132db935SJakub Kicinski.. SPDX-License-Identifier: GPL-2.0
2132db935SJakub Kicinski
3132db935SJakub Kicinski============================================================
4132db935SJakub KicinskiLinux kernel driver for Elastic Network Adapter (ENA) family
5132db935SJakub Kicinski============================================================
6132db935SJakub Kicinski
7132db935SJakub KicinskiOverview
8132db935SJakub Kicinski========
9132db935SJakub Kicinski
10132db935SJakub KicinskiENA is a networking interface designed to make good use of modern CPU
11132db935SJakub Kicinskifeatures and system architectures.
12132db935SJakub Kicinski
13132db935SJakub KicinskiThe ENA device exposes a lightweight management interface with a
14511c537bSShay Agroskinminimal set of memory mapped registers and extendible command set
15132db935SJakub Kicinskithrough an Admin Queue.
16132db935SJakub Kicinski
17132db935SJakub KicinskiThe driver supports a range of ENA devices, is link-speed independent
18511c537bSShay Agroskin(i.e., the same driver is used for 10GbE, 25GbE, 40GbE, etc), and has
19511c537bSShay Agroskina negotiated and extendible feature set.
20132db935SJakub Kicinski
21132db935SJakub KicinskiSome ENA devices support SR-IOV. This driver is used for both the
22132db935SJakub KicinskiSR-IOV Physical Function (PF) and Virtual Function (VF) devices.
23132db935SJakub Kicinski
24132db935SJakub KicinskiENA devices enable high speed and low overhead network traffic
25132db935SJakub Kicinskiprocessing by providing multiple Tx/Rx queue pairs (the maximum number
26132db935SJakub Kicinskiis advertised by the device via the Admin Queue), a dedicated MSI-X
27132db935SJakub Kicinskiinterrupt vector per Tx/Rx queue pair, adaptive interrupt moderation,
28132db935SJakub Kicinskiand CPU cacheline optimized data placement.
29132db935SJakub Kicinski
30511c537bSShay AgroskinThe ENA driver supports industry standard TCP/IP offload features such as
31511c537bSShay Agroskinchecksum offload. Receive-side scaling (RSS) is supported for multi-core
32511c537bSShay Agroskinscaling.
33132db935SJakub Kicinski
34132db935SJakub KicinskiThe ENA driver and its corresponding devices implement health
35132db935SJakub Kicinskimonitoring mechanisms such as watchdog, enabling the device and driver
36132db935SJakub Kicinskito recover in a manner transparent to the application, as well as
37132db935SJakub Kicinskidebug logs.
38132db935SJakub Kicinski
39132db935SJakub KicinskiSome of the ENA devices support a working mode called Low-latency
40132db935SJakub KicinskiQueue (LLQ), which saves several more microseconds.
415dfbbaa2SDavid Arinzon
42132db935SJakub KicinskiENA Source Code Directory Structure
43132db935SJakub Kicinski===================================
44132db935SJakub Kicinski
45132db935SJakub Kicinski=================   ======================================================
46132db935SJakub Kicinskiena_com.[ch]        Management communication layer. This layer is
47132db935SJakub Kicinski                    responsible for the handling all the management
48132db935SJakub Kicinski                    (admin) communication between the device and the
49132db935SJakub Kicinski                    driver.
50132db935SJakub Kicinskiena_eth_com.[ch]    Tx/Rx data path.
51132db935SJakub Kicinskiena_admin_defs.h    Definition of ENA management interface.
52132db935SJakub Kicinskiena_eth_io_defs.h   Definition of ENA data path interface.
53132db935SJakub Kicinskiena_common_defs.h   Common definitions for ena_com layer.
54132db935SJakub Kicinskiena_regs_defs.h     Definition of ENA PCI memory-mapped (MMIO) registers.
55132db935SJakub Kicinskiena_netdev.[ch]     Main Linux kernel driver.
56132db935SJakub Kicinskiena_ethtool.c       ethtool callbacks.
57*c891d767SDavid Arinzonena_xdp.[ch]        XDP files
58132db935SJakub Kicinskiena_pci_id_tbl.h    Supported device IDs.
59132db935SJakub Kicinski=================   ======================================================
60132db935SJakub Kicinski
61132db935SJakub KicinskiManagement Interface:
62132db935SJakub Kicinski=====================
63132db935SJakub Kicinski
64132db935SJakub KicinskiENA management interface is exposed by means of:
65132db935SJakub Kicinski
66132db935SJakub Kicinski- PCIe Configuration Space
67132db935SJakub Kicinski- Device Registers
68132db935SJakub Kicinski- Admin Queue (AQ) and Admin Completion Queue (ACQ)
69132db935SJakub Kicinski- Asynchronous Event Notification Queue (AENQ)
70132db935SJakub Kicinski
71132db935SJakub KicinskiENA device MMIO Registers are accessed only during driver
72511c537bSShay Agroskininitialization and are not used during further normal device
73132db935SJakub Kicinskioperation.
74132db935SJakub Kicinski
75132db935SJakub KicinskiAQ is used for submitting management commands, and the
76132db935SJakub Kicinskiresults/responses are reported asynchronously through ACQ.
77132db935SJakub Kicinski
78132db935SJakub KicinskiENA introduces a small set of management commands with room for
79132db935SJakub Kicinskivendor-specific extensions. Most of the management operations are
80132db935SJakub Kicinskiframed in a generic Get/Set feature command.
81132db935SJakub Kicinski
82132db935SJakub KicinskiThe following admin queue commands are supported:
83132db935SJakub Kicinski
84132db935SJakub Kicinski- Create I/O submission queue
85132db935SJakub Kicinski- Create I/O completion queue
86132db935SJakub Kicinski- Destroy I/O submission queue
87132db935SJakub Kicinski- Destroy I/O completion queue
88132db935SJakub Kicinski- Get feature
89132db935SJakub Kicinski- Set feature
90132db935SJakub Kicinski- Configure AENQ
91132db935SJakub Kicinski- Get statistics
92132db935SJakub Kicinski
93132db935SJakub KicinskiRefer to ena_admin_defs.h for the list of supported Get/Set Feature
94132db935SJakub Kicinskiproperties.
95132db935SJakub Kicinski
96132db935SJakub KicinskiThe Asynchronous Event Notification Queue (AENQ) is a uni-directional
97132db935SJakub Kicinskiqueue used by the ENA device to send to the driver events that cannot
98132db935SJakub Kicinskibe reported using ACQ. AENQ events are subdivided into groups. Each
99132db935SJakub Kicinskigroup may have multiple syndromes, as shown below
100132db935SJakub Kicinski
101132db935SJakub KicinskiThe events are:
102132db935SJakub Kicinski
103132db935SJakub Kicinski====================    ===============
104132db935SJakub KicinskiGroup                   Syndrome
105132db935SJakub Kicinski====================    ===============
106132db935SJakub KicinskiLink state change       **X**
107132db935SJakub KicinskiFatal error             **X**
108132db935SJakub KicinskiNotification            Suspend traffic
109132db935SJakub KicinskiNotification            Resume traffic
110132db935SJakub KicinskiKeep-Alive              **X**
111132db935SJakub Kicinski====================    ===============
112132db935SJakub Kicinski
113132db935SJakub KicinskiACQ and AENQ share the same MSI-X vector.
114132db935SJakub Kicinski
115511c537bSShay AgroskinKeep-Alive is a special mechanism that allows monitoring the device's health.
116511c537bSShay AgroskinA Keep-Alive event is delivered by the device every second.
117511c537bSShay AgroskinThe driver maintains a watchdog (WD) handler which logs the current state and
118511c537bSShay Agroskinstatistics. If the keep-alive events aren't delivered as expected the WD resets
119511c537bSShay Agroskinthe device and the driver.
120132db935SJakub Kicinski
121132db935SJakub KicinskiData Path Interface
122132db935SJakub Kicinski===================
123511c537bSShay Agroskin
124132db935SJakub KicinskiI/O operations are based on Tx and Rx Submission Queues (Tx SQ and Rx
125132db935SJakub KicinskiSQ correspondingly). Each SQ has a completion queue (CQ) associated
126132db935SJakub Kicinskiwith it.
127132db935SJakub Kicinski
128132db935SJakub KicinskiThe SQs and CQs are implemented as descriptor rings in contiguous
129132db935SJakub Kicinskiphysical memory.
130132db935SJakub Kicinski
131132db935SJakub KicinskiThe ENA driver supports two Queue Operation modes for Tx SQs:
132132db935SJakub Kicinski
133511c537bSShay Agroskin- **Regular mode:**
134511c537bSShay Agroskin  In this mode the Tx SQs reside in the host's memory. The ENA
135132db935SJakub Kicinski  device fetches the ENA Tx descriptors and packet data from host
136132db935SJakub Kicinski  memory.
137132db935SJakub Kicinski
138511c537bSShay Agroskin- **Low Latency Queue (LLQ) mode or "push-mode":**
139511c537bSShay Agroskin  In this mode the driver pushes the transmit descriptors and the
140273a2397SArthur Kiyanovski  first 96 bytes of the packet directly to the ENA device memory
141132db935SJakub Kicinski  space. The rest of the packet payload is fetched by the
142132db935SJakub Kicinski  device. For this operation mode, the driver uses a dedicated PCI
143132db935SJakub Kicinski  device memory BAR, which is mapped with write-combine capability.
144132db935SJakub Kicinski
145511c537bSShay Agroskin  **Note that** not all ENA devices support LLQ, and this feature is negotiated
146132db935SJakub Kicinski  with the device upon initialization. If the ENA device does not
147132db935SJakub Kicinski  support LLQ mode, the driver falls back to the regular mode.
148132db935SJakub Kicinski
149511c537bSShay AgroskinThe Rx SQs support only the regular mode.
150511c537bSShay Agroskin
151132db935SJakub KicinskiThe driver supports multi-queue for both Tx and Rx. This has various
152132db935SJakub Kicinskibenefits:
153132db935SJakub Kicinski
154132db935SJakub Kicinski- Reduced CPU/thread/process contention on a given Ethernet interface.
155132db935SJakub Kicinski- Cache miss rate on completion is reduced, particularly for data
156132db935SJakub Kicinski  cache lines that hold the sk_buff structures.
157132db935SJakub Kicinski- Increased process-level parallelism when handling received packets.
158132db935SJakub Kicinski- Increased data cache hit rate, by steering kernel processing of
159132db935SJakub Kicinski  packets to the CPU, where the application thread consuming the
160132db935SJakub Kicinski  packet is running.
161132db935SJakub Kicinski- In hardware interrupt re-direction.
162132db935SJakub Kicinski
163132db935SJakub KicinskiInterrupt Modes
164132db935SJakub Kicinski===============
165511c537bSShay Agroskin
166132db935SJakub KicinskiThe driver assigns a single MSI-X vector per queue pair (for both Tx
167132db935SJakub Kicinskiand Rx directions). The driver assigns an additional dedicated MSI-X vector
168132db935SJakub Kicinskifor management (for ACQ and AENQ).
169132db935SJakub Kicinski
170132db935SJakub KicinskiManagement interrupt registration is performed when the Linux kernel
171132db935SJakub Kicinskiprobes the adapter, and it is de-registered when the adapter is
172132db935SJakub Kicinskiremoved. I/O queue interrupt registration is performed when the Linux
173132db935SJakub Kicinskiinterface of the adapter is opened, and it is de-registered when the
174132db935SJakub Kicinskiinterface is closed.
175132db935SJakub Kicinski
176132db935SJakub KicinskiThe management interrupt is named::
177132db935SJakub Kicinski
178132db935SJakub Kicinski   ena-mgmnt@pci:<PCI domain:bus:slot.function>
179132db935SJakub Kicinski
180132db935SJakub Kicinskiand for each queue pair, an interrupt is named::
181132db935SJakub Kicinski
182132db935SJakub Kicinski   <interface name>-Tx-Rx-<queue index>
183132db935SJakub Kicinski
184132db935SJakub KicinskiThe ENA device operates in auto-mask and auto-clear interrupt
185132db935SJakub Kicinskimodes. That is, once MSI-X is delivered to the host, its Cause bit is
186132db935SJakub Kicinskiautomatically cleared and the interrupt is masked. The interrupt is
187132db935SJakub Kicinskiunmasked by the driver after NAPI processing is complete.
188132db935SJakub Kicinski
189132db935SJakub KicinskiInterrupt Moderation
190132db935SJakub Kicinski====================
191511c537bSShay Agroskin
192132db935SJakub KicinskiENA driver and device can operate in conventional or adaptive interrupt
193132db935SJakub Kicinskimoderation mode.
194132db935SJakub Kicinski
195511c537bSShay Agroskin**In conventional mode** the driver instructs device to postpone interrupt
196132db935SJakub Kicinskiposting according to static interrupt delay value. The interrupt delay
197511c537bSShay Agroskinvalue can be configured through `ethtool(8)`. The following `ethtool`
198511c537bSShay Agroskinparameters are supported by the driver: ``tx-usecs``, ``rx-usecs``
199132db935SJakub Kicinski
200511c537bSShay Agroskin**In adaptive interrupt** moderation mode the interrupt delay value is
201132db935SJakub Kicinskiupdated by the driver dynamically and adjusted every NAPI cycle
202132db935SJakub Kicinskiaccording to the traffic nature.
203132db935SJakub Kicinski
204511c537bSShay AgroskinAdaptive coalescing can be switched on/off through `ethtool(8)`'s
205511c537bSShay Agroskin:code:`adaptive_rx on|off` parameter.
206132db935SJakub Kicinski
207c452f375SShay AgroskinMore information about Adaptive Interrupt Moderation (DIM) can be found in
208c452f375SShay AgroskinDocumentation/networking/net_dim.rst
209132db935SJakub Kicinski
210f7d625adSDavid Arinzon.. _`RX copybreak`:
2115dfbbaa2SDavid Arinzon
212132db935SJakub KicinskiRX copybreak
213132db935SJakub Kicinski============
214132db935SJakub KicinskiThe rx_copybreak is initialized by default to ENA_DEFAULT_RX_COPYBREAK
215132db935SJakub Kicinskiand can be configured by the ETHTOOL_STUNABLE command of the
216132db935SJakub KicinskiSIOCETHTOOL ioctl.
217132db935SJakub Kicinski
218132db935SJakub KicinskiStatistics
219132db935SJakub Kicinski==========
220511c537bSShay Agroskin
221511c537bSShay AgroskinThe user can obtain ENA device and driver statistics using `ethtool`.
222132db935SJakub KicinskiThe driver can collect regular or extended statistics (including
223132db935SJakub Kicinskiper-queue stats) from the device.
224132db935SJakub Kicinski
225132db935SJakub KicinskiIn addition the driver logs the stats to syslog upon device reset.
226132db935SJakub Kicinski
227132db935SJakub KicinskiMTU
228132db935SJakub Kicinski===
229511c537bSShay Agroskin
230132db935SJakub KicinskiThe driver supports an arbitrarily large MTU with a maximum that is
231132db935SJakub Kicinskinegotiated with the device. The driver configures MTU using the
232132db935SJakub KicinskiSetFeature command (ENA_ADMIN_MTU property). The user can change MTU
233511c537bSShay Agroskinvia `ip(8)` and similar legacy tools.
234132db935SJakub Kicinski
235132db935SJakub KicinskiStateless Offloads
236132db935SJakub Kicinski==================
237511c537bSShay Agroskin
238132db935SJakub KicinskiThe ENA driver supports:
239132db935SJakub Kicinski
240132db935SJakub Kicinski- IPv4 header checksum offload
241132db935SJakub Kicinski- TCP/UDP over IPv4/IPv6 checksum offloads
242132db935SJakub Kicinski
243132db935SJakub KicinskiRSS
244132db935SJakub Kicinski===
245511c537bSShay Agroskin
246132db935SJakub Kicinski- The ENA device supports RSS that allows flexible Rx traffic
247132db935SJakub Kicinski  steering.
248132db935SJakub Kicinski- Toeplitz and CRC32 hash functions are supported.
249132db935SJakub Kicinski- Different combinations of L2/L3/L4 fields can be configured as
250132db935SJakub Kicinski  inputs for hash functions.
251132db935SJakub Kicinski- The driver configures RSS settings using the AQ SetFeature command
252132db935SJakub Kicinski  (ENA_ADMIN_RSS_HASH_FUNCTION, ENA_ADMIN_RSS_HASH_INPUT and
2530deca83fSShay Agroskin  ENA_ADMIN_RSS_INDIRECTION_TABLE_CONFIG properties).
254132db935SJakub Kicinski- If the NETIF_F_RXHASH flag is set, the 32-bit result of the hash
255132db935SJakub Kicinski  function delivered in the Rx CQ descriptor is set in the received
256132db935SJakub Kicinski  SKB.
257132db935SJakub Kicinski- The user can provide a hash key, hash function, and configure the
258511c537bSShay Agroskin  indirection table through `ethtool(8)`.
259132db935SJakub Kicinski
260132db935SJakub KicinskiDATA PATH
261132db935SJakub Kicinski=========
262511c537bSShay Agroskin
263132db935SJakub KicinskiTx
264132db935SJakub Kicinski--
265132db935SJakub Kicinski
266511c537bSShay Agroskin:code:`ena_start_xmit()` is called by the stack. This function does the following:
267132db935SJakub Kicinski
268511c537bSShay Agroskin- Maps data buffers (``skb->data`` and frags).
269511c537bSShay Agroskin- Populates ``ena_buf`` for the push buffer (if the driver and device are
270511c537bSShay Agroskin  in push mode).
271132db935SJakub Kicinski- Prepares ENA bufs for the remaining frags.
272511c537bSShay Agroskin- Allocates a new request ID from the empty ``req_id`` ring. The request
273132db935SJakub Kicinski  ID is the index of the packet in the Tx info. This is used for
274511c537bSShay Agroskin  out-of-order Tx completions.
275132db935SJakub Kicinski- Adds the packet to the proper place in the Tx ring.
276511c537bSShay Agroskin- Calls :code:`ena_com_prepare_tx()`, an ENA communication layer that converts
277511c537bSShay Agroskin  the ``ena_bufs`` to ENA descriptors (and adds meta ENA descriptors as
278511c537bSShay Agroskin  needed).
279132db935SJakub Kicinski
280132db935SJakub Kicinski  * This function also copies the ENA descriptors and the push buffer
281511c537bSShay Agroskin    to the Device memory space (if in push mode).
282132db935SJakub Kicinski
283511c537bSShay Agroskin- Writes a doorbell to the ENA device.
284132db935SJakub Kicinski- When the ENA device finishes sending the packet, a completion
285132db935SJakub Kicinski  interrupt is raised.
286132db935SJakub Kicinski- The interrupt handler schedules NAPI.
287511c537bSShay Agroskin- The :code:`ena_clean_tx_irq()` function is called. This function handles the
288132db935SJakub Kicinski  completion descriptors generated by the ENA, with a single
289132db935SJakub Kicinski  completion descriptor per completed packet.
290132db935SJakub Kicinski
291511c537bSShay Agroskin  * ``req_id`` is retrieved from the completion descriptor. The ``tx_info`` of
292511c537bSShay Agroskin    the packet is retrieved via the ``req_id``. The data buffers are
293511c537bSShay Agroskin    unmapped and ``req_id`` is returned to the empty ``req_id`` ring.
294132db935SJakub Kicinski  * The function stops when the completion descriptors are completed or
295132db935SJakub Kicinski    the budget is reached.
296132db935SJakub Kicinski
297132db935SJakub KicinskiRx
298132db935SJakub Kicinski--
299132db935SJakub Kicinski
300132db935SJakub Kicinski- When a packet is received from the ENA device.
301132db935SJakub Kicinski- The interrupt handler schedules NAPI.
302511c537bSShay Agroskin- The :code:`ena_clean_rx_irq()` function is called. This function calls
303511c537bSShay Agroskin  :code:`ena_com_rx_pkt()`, an ENA communication layer function, which returns the
304511c537bSShay Agroskin  number of descriptors used for a new packet, and zero if
305132db935SJakub Kicinski  no new packet is found.
306511c537bSShay Agroskin- :code:`ena_rx_skb()` checks packet length:
307132db935SJakub Kicinski
308132db935SJakub Kicinski  * If the packet is small (len < rx_copybreak), the driver allocates
309132db935SJakub Kicinski    a SKB for the new packet, and copies the packet payload into the
310132db935SJakub Kicinski    SKB data buffer.
311132db935SJakub Kicinski
312132db935SJakub Kicinski    - In this way the original data buffer is not passed to the stack
313132db935SJakub Kicinski      and is reused for future Rx packets.
314132db935SJakub Kicinski
315511c537bSShay Agroskin  * Otherwise the function unmaps the Rx buffer, sets the first
316511c537bSShay Agroskin    descriptor as `skb`'s linear part and the other descriptors as the
317511c537bSShay Agroskin    `skb`'s frags.
318132db935SJakub Kicinski
319132db935SJakub Kicinski- The new SKB is updated with the necessary information (protocol,
320511c537bSShay Agroskin  checksum hw verify result, etc), and then passed to the network
321511c537bSShay Agroskin  stack, using the NAPI interface function :code:`napi_gro_receive()`.
322f7d625adSDavid Arinzon
323f7d625adSDavid ArinzonDynamic RX Buffers (DRB)
324f7d625adSDavid Arinzon------------------------
325f7d625adSDavid Arinzon
326f7d625adSDavid ArinzonEach RX descriptor in the RX ring is a single memory page (which is either 4KB
327f7d625adSDavid Arinzonor 16KB long depending on system's configurations).
328f7d625adSDavid ArinzonTo reduce the memory allocations required when dealing with a high rate of small
329f7d625adSDavid Arinzonpackets, the driver tries to reuse the remaining RX descriptor's space if more
330f7d625adSDavid Arinzonthan 2KB of this page remain unused.
331f7d625adSDavid Arinzon
332f7d625adSDavid ArinzonA simple example of this mechanism is the following sequence of events:
333f7d625adSDavid Arinzon
334f7d625adSDavid Arinzon::
335f7d625adSDavid Arinzon
336f7d625adSDavid Arinzon        1. Driver allocates page-sized RX buffer and passes it to hardware
337f7d625adSDavid Arinzon                +----------------------+
338f7d625adSDavid Arinzon                |4KB RX Buffer         |
339f7d625adSDavid Arinzon                +----------------------+
340f7d625adSDavid Arinzon
341f7d625adSDavid Arinzon        2. A 300Bytes packet is received on this buffer
342f7d625adSDavid Arinzon
343f7d625adSDavid Arinzon        3. The driver increases the ref count on this page and returns it back to
344f7d625adSDavid Arinzon           HW as an RX buffer of size 4KB - 300Bytes = 3796 Bytes
345f7d625adSDavid Arinzon               +----+--------------------+
346f7d625adSDavid Arinzon               |****|3796 Bytes RX Buffer|
347f7d625adSDavid Arinzon               +----+--------------------+
348f7d625adSDavid Arinzon
349f7d625adSDavid ArinzonThis mechanism isn't used when an XDP program is loaded, or when the
350f7d625adSDavid ArinzonRX packet is less than rx_copybreak bytes (in which case the packet is
351f7d625adSDavid Arinzoncopied out of the RX buffer into the linear part of a new skb allocated
352f7d625adSDavid Arinzonfor it and the RX buffer remains the same size, see `RX copybreak`_).
353