xref: /openbmc/linux/Documentation/networking/scaling.rst (revision 2612e3bbc0386368a850140a6c9b990cd496a5ec)
1e6e37f63SOtto Sabart.. SPDX-License-Identifier: GPL-2.0
2e6e37f63SOtto Sabart
3e6e37f63SOtto Sabart=====================================
4e6e37f63SOtto SabartScaling in the Linux Networking Stack
5e6e37f63SOtto Sabart=====================================
6e6e37f63SOtto Sabart
7e6e37f63SOtto Sabart
8e6e37f63SOtto SabartIntroduction
9e6e37f63SOtto Sabart============
10e6e37f63SOtto Sabart
11e6e37f63SOtto SabartThis document describes a set of complementary techniques in the Linux
12e6e37f63SOtto Sabartnetworking stack to increase parallelism and improve performance for
13e6e37f63SOtto Sabartmulti-processor systems.
14e6e37f63SOtto Sabart
15e6e37f63SOtto SabartThe following technologies are described:
16e6e37f63SOtto Sabart
17e6e37f63SOtto Sabart- RSS: Receive Side Scaling
18e6e37f63SOtto Sabart- RPS: Receive Packet Steering
19e6e37f63SOtto Sabart- RFS: Receive Flow Steering
20e6e37f63SOtto Sabart- Accelerated Receive Flow Steering
21e6e37f63SOtto Sabart- XPS: Transmit Packet Steering
22e6e37f63SOtto Sabart
23e6e37f63SOtto Sabart
24e6e37f63SOtto SabartRSS: Receive Side Scaling
25e6e37f63SOtto Sabart=========================
26e6e37f63SOtto Sabart
27e6e37f63SOtto SabartContemporary NICs support multiple receive and transmit descriptor queues
28e6e37f63SOtto Sabart(multi-queue). On reception, a NIC can send different packets to different
29e6e37f63SOtto Sabartqueues to distribute processing among CPUs. The NIC distributes packets by
30e6e37f63SOtto Sabartapplying a filter to each packet that assigns it to one of a small number
31e6e37f63SOtto Sabartof logical flows. Packets for each flow are steered to a separate receive
32e6e37f63SOtto Sabartqueue, which in turn can be processed by separate CPUs. This mechanism is
33e6e37f63SOtto Sabartgenerally known as “Receive-side Scaling” (RSS). The goal of RSS and
34e6e37f63SOtto Sabartthe other scaling techniques is to increase performance uniformly.
35e6e37f63SOtto SabartMulti-queue distribution can also be used for traffic prioritization, but
36e6e37f63SOtto Sabartthat is not the focus of these techniques.
37e6e37f63SOtto Sabart
38e6e37f63SOtto SabartThe filter used in RSS is typically a hash function over the network
39e6e37f63SOtto Sabartand/or transport layer headers-- for example, a 4-tuple hash over
40e6e37f63SOtto SabartIP addresses and TCP ports of a packet. The most common hardware
41e6e37f63SOtto Sabartimplementation of RSS uses a 128-entry indirection table where each entry
42e6e37f63SOtto Sabartstores a queue number. The receive queue for a packet is determined
43e6e37f63SOtto Sabartby masking out the low order seven bits of the computed hash for the
44e6e37f63SOtto Sabartpacket (usually a Toeplitz hash), taking this number as a key into the
45e6e37f63SOtto Sabartindirection table and reading the corresponding value.
46e6e37f63SOtto Sabart
47e6e37f63SOtto SabartSome advanced NICs allow steering packets to queues based on
48e6e37f63SOtto Sabartprogrammable filters. For example, webserver bound TCP port 80 packets
49e6e37f63SOtto Sabartcan be directed to their own receive queue. Such “n-tuple” filters can
50e6e37f63SOtto Sabartbe configured from ethtool (--config-ntuple).
51e6e37f63SOtto Sabart
52e6e37f63SOtto Sabart
53e6e37f63SOtto SabartRSS Configuration
54e6e37f63SOtto Sabart-----------------
55e6e37f63SOtto Sabart
56e6e37f63SOtto SabartThe driver for a multi-queue capable NIC typically provides a kernel
57e6e37f63SOtto Sabartmodule parameter for specifying the number of hardware queues to
58e6e37f63SOtto Sabartconfigure. In the bnx2x driver, for instance, this parameter is called
59e6e37f63SOtto Sabartnum_queues. A typical RSS configuration would be to have one receive queue
60e6e37f63SOtto Sabartfor each CPU if the device supports enough queues, or otherwise at least
61e6e37f63SOtto Sabartone for each memory domain, where a memory domain is a set of CPUs that
62e6e37f63SOtto Sabartshare a particular memory level (L1, L2, NUMA node, etc.).
63e6e37f63SOtto Sabart
64e6e37f63SOtto SabartThe indirection table of an RSS device, which resolves a queue by masked
65e6e37f63SOtto Sabarthash, is usually programmed by the driver at initialization. The
66e6e37f63SOtto Sabartdefault mapping is to distribute the queues evenly in the table, but the
67e6e37f63SOtto Sabartindirection table can be retrieved and modified at runtime using ethtool
68e6e37f63SOtto Sabartcommands (--show-rxfh-indir and --set-rxfh-indir). Modifying the
69e6e37f63SOtto Sabartindirection table could be done to give different queues different
70e6e37f63SOtto Sabartrelative weights.
71e6e37f63SOtto Sabart
72e6e37f63SOtto Sabart
73e6e37f63SOtto SabartRSS IRQ Configuration
74e6e37f63SOtto Sabart~~~~~~~~~~~~~~~~~~~~~
75e6e37f63SOtto Sabart
76e6e37f63SOtto SabartEach receive queue has a separate IRQ associated with it. The NIC triggers
77e6e37f63SOtto Sabartthis to notify a CPU when new packets arrive on the given queue. The
78e6e37f63SOtto Sabartsignaling path for PCIe devices uses message signaled interrupts (MSI-X),
79e6e37f63SOtto Sabartthat can route each interrupt to a particular CPU. The active mapping
80e6e37f63SOtto Sabartof queues to IRQs can be determined from /proc/interrupts. By default,
81e6e37f63SOtto Sabartan IRQ may be handled on any CPU. Because a non-negligible part of packet
82e6e37f63SOtto Sabartprocessing takes place in receive interrupt handling, it is advantageous
83e6e37f63SOtto Sabartto spread receive interrupts between CPUs. To manually adjust the IRQ
84e00b0ab8SMauro Carvalho Chehabaffinity of each interrupt see Documentation/core-api/irq/irq-affinity.rst. Some systems
85e6e37f63SOtto Sabartwill be running irqbalance, a daemon that dynamically optimizes IRQ
86e6e37f63SOtto Sabartassignments and as a result may override any manual settings.
87e6e37f63SOtto Sabart
88e6e37f63SOtto Sabart
89e6e37f63SOtto SabartSuggested Configuration
90e6e37f63SOtto Sabart~~~~~~~~~~~~~~~~~~~~~~~
91e6e37f63SOtto Sabart
92e6e37f63SOtto SabartRSS should be enabled when latency is a concern or whenever receive
93e6e37f63SOtto Sabartinterrupt processing forms a bottleneck. Spreading load between CPUs
94e6e37f63SOtto Sabartdecreases queue length. For low latency networking, the optimal setting
95e6e37f63SOtto Sabartis to allocate as many queues as there are CPUs in the system (or the
96e6e37f63SOtto SabartNIC maximum, if lower). The most efficient high-rate configuration
97e6e37f63SOtto Sabartis likely the one with the smallest number of receive queues where no
98e6e37f63SOtto Sabartreceive queue overflows due to a saturated CPU, because in default
99e6e37f63SOtto Sabartmode with interrupt coalescing enabled, the aggregate number of
100e6e37f63SOtto Sabartinterrupts (and thus work) grows with each additional queue.
101e6e37f63SOtto Sabart
102e6e37f63SOtto SabartPer-cpu load can be observed using the mpstat utility, but note that on
103e6e37f63SOtto Sabartprocessors with hyperthreading (HT), each hyperthread is represented as
104e6e37f63SOtto Sabarta separate CPU. For interrupt handling, HT has shown no benefit in
105e6e37f63SOtto Sabartinitial tests, so limit the number of queues to the number of CPU cores
106e6e37f63SOtto Sabartin the system.
107e6e37f63SOtto Sabart
108e6e37f63SOtto Sabart
109e6e37f63SOtto SabartRPS: Receive Packet Steering
110e6e37f63SOtto Sabart============================
111e6e37f63SOtto Sabart
112e6e37f63SOtto SabartReceive Packet Steering (RPS) is logically a software implementation of
113e6e37f63SOtto SabartRSS. Being in software, it is necessarily called later in the datapath.
114e6e37f63SOtto SabartWhereas RSS selects the queue and hence CPU that will run the hardware
115e6e37f63SOtto Sabartinterrupt handler, RPS selects the CPU to perform protocol processing
116e6e37f63SOtto Sabartabove the interrupt handler. This is accomplished by placing the packet
117e6e37f63SOtto Sabarton the desired CPU’s backlog queue and waking up the CPU for processing.
118e6e37f63SOtto SabartRPS has some advantages over RSS:
119e6e37f63SOtto Sabart
120e6e37f63SOtto Sabart1) it can be used with any NIC
121e6e37f63SOtto Sabart2) software filters can easily be added to hash over new protocols
122e6e37f63SOtto Sabart3) it does not increase hardware device interrupt rate (although it does
123e6e37f63SOtto Sabart   introduce inter-processor interrupts (IPIs))
124e6e37f63SOtto Sabart
125e6e37f63SOtto SabartRPS is called during bottom half of the receive interrupt handler, when
126e6e37f63SOtto Sabarta driver sends a packet up the network stack with netif_rx() or
127e6e37f63SOtto Sabartnetif_receive_skb(). These call the get_rps_cpu() function, which
128e6e37f63SOtto Sabartselects the queue that should process a packet.
129e6e37f63SOtto Sabart
130e6e37f63SOtto SabartThe first step in determining the target CPU for RPS is to calculate a
131e6e37f63SOtto Sabartflow hash over the packet’s addresses or ports (2-tuple or 4-tuple hash
132e6e37f63SOtto Sabartdepending on the protocol). This serves as a consistent hash of the
133e6e37f63SOtto Sabartassociated flow of the packet. The hash is either provided by hardware
134e6e37f63SOtto Sabartor will be computed in the stack. Capable hardware can pass the hash in
135e6e37f63SOtto Sabartthe receive descriptor for the packet; this would usually be the same
136e6e37f63SOtto Sabarthash used for RSS (e.g. computed Toeplitz hash). The hash is saved in
137e6e37f63SOtto Sabartskb->hash and can be used elsewhere in the stack as a hash of the
138e6e37f63SOtto Sabartpacket’s flow.
139e6e37f63SOtto Sabart
140e6e37f63SOtto SabartEach receive hardware queue has an associated list of CPUs to which
141e6e37f63SOtto SabartRPS may enqueue packets for processing. For each received packet,
142e6e37f63SOtto Sabartan index into the list is computed from the flow hash modulo the size
143e6e37f63SOtto Sabartof the list. The indexed CPU is the target for processing the packet,
144e6e37f63SOtto Sabartand the packet is queued to the tail of that CPU’s backlog queue. At
145e6e37f63SOtto Sabartthe end of the bottom half routine, IPIs are sent to any CPUs for which
146e6e37f63SOtto Sabartpackets have been queued to their backlog queue. The IPI wakes backlog
147e6e37f63SOtto Sabartprocessing on the remote CPU, and any queued packets are then processed
148e6e37f63SOtto Sabartup the networking stack.
149e6e37f63SOtto Sabart
150e6e37f63SOtto Sabart
151e6e37f63SOtto SabartRPS Configuration
152e6e37f63SOtto Sabart-----------------
153e6e37f63SOtto Sabart
154e6e37f63SOtto SabartRPS requires a kernel compiled with the CONFIG_RPS kconfig symbol (on
155e6e37f63SOtto Sabartby default for SMP). Even when compiled in, RPS remains disabled until
156e6e37f63SOtto Sabartexplicitly configured. The list of CPUs to which RPS may forward traffic
157e6e37f63SOtto Sabartcan be configured for each receive queue using a sysfs file entry::
158e6e37f63SOtto Sabart
159e6e37f63SOtto Sabart  /sys/class/net/<dev>/queues/rx-<n>/rps_cpus
160e6e37f63SOtto Sabart
161e6e37f63SOtto SabartThis file implements a bitmap of CPUs. RPS is disabled when it is zero
162e6e37f63SOtto Sabart(the default), in which case packets are processed on the interrupting
163e00b0ab8SMauro Carvalho ChehabCPU. Documentation/core-api/irq/irq-affinity.rst explains how CPUs are assigned to
164e6e37f63SOtto Sabartthe bitmap.
165e6e37f63SOtto Sabart
166e6e37f63SOtto Sabart
167e6e37f63SOtto SabartSuggested Configuration
168e6e37f63SOtto Sabart~~~~~~~~~~~~~~~~~~~~~~~
169e6e37f63SOtto Sabart
170e6e37f63SOtto SabartFor a single queue device, a typical RPS configuration would be to set
171e6e37f63SOtto Sabartthe rps_cpus to the CPUs in the same memory domain of the interrupting
172e6e37f63SOtto SabartCPU. If NUMA locality is not an issue, this could also be all CPUs in
173e6e37f63SOtto Sabartthe system. At high interrupt rate, it might be wise to exclude the
174e6e37f63SOtto Sabartinterrupting CPU from the map since that already performs much work.
175e6e37f63SOtto Sabart
176e6e37f63SOtto SabartFor a multi-queue system, if RSS is configured so that a hardware
177e6e37f63SOtto Sabartreceive queue is mapped to each CPU, then RPS is probably redundant
178e6e37f63SOtto Sabartand unnecessary. If there are fewer hardware queues than CPUs, then
179e6e37f63SOtto SabartRPS might be beneficial if the rps_cpus for each queue are the ones that
180e6e37f63SOtto Sabartshare the same memory domain as the interrupting CPU for that queue.
181e6e37f63SOtto Sabart
182e6e37f63SOtto Sabart
183e6e37f63SOtto SabartRPS Flow Limit
184e6e37f63SOtto Sabart--------------
185e6e37f63SOtto Sabart
186e6e37f63SOtto SabartRPS scales kernel receive processing across CPUs without introducing
187e6e37f63SOtto Sabartreordering. The trade-off to sending all packets from the same flow
188e6e37f63SOtto Sabartto the same CPU is CPU load imbalance if flows vary in packet rate.
189e6e37f63SOtto SabartIn the extreme case a single flow dominates traffic. Especially on
190e6e37f63SOtto Sabartcommon server workloads with many concurrent connections, such
191e6e37f63SOtto Sabartbehavior indicates a problem such as a misconfiguration or spoofed
192e6e37f63SOtto Sabartsource Denial of Service attack.
193e6e37f63SOtto Sabart
194e6e37f63SOtto SabartFlow Limit is an optional RPS feature that prioritizes small flows
195e6e37f63SOtto Sabartduring CPU contention by dropping packets from large flows slightly
196e6e37f63SOtto Sabartahead of those from small flows. It is active only when an RPS or RFS
197e6e37f63SOtto Sabartdestination CPU approaches saturation.  Once a CPU's input packet
198e6e37f63SOtto Sabartqueue exceeds half the maximum queue length (as set by sysctl
199e6e37f63SOtto Sabartnet.core.netdev_max_backlog), the kernel starts a per-flow packet
200e6e37f63SOtto Sabartcount over the last 256 packets. If a flow exceeds a set ratio (by
201e6e37f63SOtto Sabartdefault, half) of these packets when a new packet arrives, then the
202e6e37f63SOtto Sabartnew packet is dropped. Packets from other flows are still only
203e6e37f63SOtto Sabartdropped once the input packet queue reaches netdev_max_backlog.
204e6e37f63SOtto SabartNo packets are dropped when the input packet queue length is below
205e6e37f63SOtto Sabartthe threshold, so flow limit does not sever connections outright:
206e6e37f63SOtto Sabarteven large flows maintain connectivity.
207e6e37f63SOtto Sabart
208e6e37f63SOtto Sabart
209e6e37f63SOtto SabartInterface
210e6e37f63SOtto Sabart~~~~~~~~~
211e6e37f63SOtto Sabart
212e6e37f63SOtto SabartFlow limit is compiled in by default (CONFIG_NET_FLOW_LIMIT), but not
213e6e37f63SOtto Sabartturned on. It is implemented for each CPU independently (to avoid lock
214e6e37f63SOtto Sabartand cache contention) and toggled per CPU by setting the relevant bit
215e6e37f63SOtto Sabartin sysctl net.core.flow_limit_cpu_bitmap. It exposes the same CPU
216e6e37f63SOtto Sabartbitmap interface as rps_cpus (see above) when called from procfs::
217e6e37f63SOtto Sabart
218e6e37f63SOtto Sabart  /proc/sys/net/core/flow_limit_cpu_bitmap
219e6e37f63SOtto Sabart
220e6e37f63SOtto SabartPer-flow rate is calculated by hashing each packet into a hashtable
221e6e37f63SOtto Sabartbucket and incrementing a per-bucket counter. The hash function is
222e6e37f63SOtto Sabartthe same that selects a CPU in RPS, but as the number of buckets can
223e6e37f63SOtto Sabartbe much larger than the number of CPUs, flow limit has finer-grained
224e6e37f63SOtto Sabartidentification of large flows and fewer false positives. The default
225e6e37f63SOtto Sabarttable has 4096 buckets. This value can be modified through sysctl::
226e6e37f63SOtto Sabart
227e6e37f63SOtto Sabart  net.core.flow_limit_table_len
228e6e37f63SOtto Sabart
229e6e37f63SOtto SabartThe value is only consulted when a new table is allocated. Modifying
230e6e37f63SOtto Sabartit does not update active tables.
231e6e37f63SOtto Sabart
232e6e37f63SOtto Sabart
233e6e37f63SOtto SabartSuggested Configuration
234e6e37f63SOtto Sabart~~~~~~~~~~~~~~~~~~~~~~~
235e6e37f63SOtto Sabart
236e6e37f63SOtto SabartFlow limit is useful on systems with many concurrent connections,
237e6e37f63SOtto Sabartwhere a single connection taking up 50% of a CPU indicates a problem.
238e6e37f63SOtto SabartIn such environments, enable the feature on all CPUs that handle
239e6e37f63SOtto Sabartnetwork rx interrupts (as set in /proc/irq/N/smp_affinity).
240e6e37f63SOtto Sabart
241e6e37f63SOtto SabartThe feature depends on the input packet queue length to exceed
242e6e37f63SOtto Sabartthe flow limit threshold (50%) + the flow history length (256).
243e6e37f63SOtto SabartSetting net.core.netdev_max_backlog to either 1000 or 10000
244e6e37f63SOtto Sabartperformed well in experiments.
245e6e37f63SOtto Sabart
246e6e37f63SOtto Sabart
247e6e37f63SOtto SabartRFS: Receive Flow Steering
248e6e37f63SOtto Sabart==========================
249e6e37f63SOtto Sabart
250e6e37f63SOtto SabartWhile RPS steers packets solely based on hash, and thus generally
251e6e37f63SOtto Sabartprovides good load distribution, it does not take into account
252e6e37f63SOtto Sabartapplication locality. This is accomplished by Receive Flow Steering
253e6e37f63SOtto Sabart(RFS). The goal of RFS is to increase datacache hitrate by steering
254e6e37f63SOtto Sabartkernel processing of packets to the CPU where the application thread
255e6e37f63SOtto Sabartconsuming the packet is running. RFS relies on the same RPS mechanisms
256e6e37f63SOtto Sabartto enqueue packets onto the backlog of another CPU and to wake up that
257e6e37f63SOtto SabartCPU.
258e6e37f63SOtto Sabart
259e6e37f63SOtto SabartIn RFS, packets are not forwarded directly by the value of their hash,
260e6e37f63SOtto Sabartbut the hash is used as index into a flow lookup table. This table maps
261e6e37f63SOtto Sabartflows to the CPUs where those flows are being processed. The flow hash
262e6e37f63SOtto Sabart(see RPS section above) is used to calculate the index into this table.
263e6e37f63SOtto SabartThe CPU recorded in each entry is the one which last processed the flow.
264e6e37f63SOtto SabartIf an entry does not hold a valid CPU, then packets mapped to that entry
265e6e37f63SOtto Sabartare steered using plain RPS. Multiple table entries may point to the
266e6e37f63SOtto Sabartsame CPU. Indeed, with many flows and few CPUs, it is very likely that
267e6e37f63SOtto Sabarta single application thread handles flows with many different flow hashes.
268e6e37f63SOtto Sabart
269e6e37f63SOtto Sabartrps_sock_flow_table is a global flow table that contains the *desired* CPU
270e6e37f63SOtto Sabartfor flows: the CPU that is currently processing the flow in userspace.
271e6e37f63SOtto SabartEach table value is a CPU index that is updated during calls to recvmsg
272*dc97391eSDavid Howellsand sendmsg (specifically, inet_recvmsg(), inet_sendmsg() and
273*dc97391eSDavid Howellstcp_splice_read()).
274e6e37f63SOtto Sabart
275e6e37f63SOtto SabartWhen the scheduler moves a thread to a new CPU while it has outstanding
276e6e37f63SOtto Sabartreceive packets on the old CPU, packets may arrive out of order. To
277e6e37f63SOtto Sabartavoid this, RFS uses a second flow table to track outstanding packets
278e6e37f63SOtto Sabartfor each flow: rps_dev_flow_table is a table specific to each hardware
279e6e37f63SOtto Sabartreceive queue of each device. Each table value stores a CPU index and a
280e6e37f63SOtto Sabartcounter. The CPU index represents the *current* CPU onto which packets
281e6e37f63SOtto Sabartfor this flow are enqueued for further kernel processing. Ideally, kernel
282e6e37f63SOtto Sabartand userspace processing occur on the same CPU, and hence the CPU index
283e6e37f63SOtto Sabartin both tables is identical. This is likely false if the scheduler has
284e6e37f63SOtto Sabartrecently migrated a userspace thread while the kernel still has packets
285e6e37f63SOtto Sabartenqueued for kernel processing on the old CPU.
286e6e37f63SOtto Sabart
287e6e37f63SOtto SabartThe counter in rps_dev_flow_table values records the length of the current
288e6e37f63SOtto SabartCPU's backlog when a packet in this flow was last enqueued. Each backlog
289e6e37f63SOtto Sabartqueue has a head counter that is incremented on dequeue. A tail counter
290e6e37f63SOtto Sabartis computed as head counter + queue length. In other words, the counter
291e6e37f63SOtto Sabartin rps_dev_flow[i] records the last element in flow i that has
292e6e37f63SOtto Sabartbeen enqueued onto the currently designated CPU for flow i (of course,
293e6e37f63SOtto Sabartentry i is actually selected by hash and multiple flows may hash to the
294e6e37f63SOtto Sabartsame entry i).
295e6e37f63SOtto Sabart
296e6e37f63SOtto SabartAnd now the trick for avoiding out of order packets: when selecting the
297e6e37f63SOtto SabartCPU for packet processing (from get_rps_cpu()) the rps_sock_flow table
298e6e37f63SOtto Sabartand the rps_dev_flow table of the queue that the packet was received on
299e6e37f63SOtto Sabartare compared. If the desired CPU for the flow (found in the
300e6e37f63SOtto Sabartrps_sock_flow table) matches the current CPU (found in the rps_dev_flow
301e6e37f63SOtto Sabarttable), the packet is enqueued onto that CPU’s backlog. If they differ,
302e6e37f63SOtto Sabartthe current CPU is updated to match the desired CPU if one of the
303e6e37f63SOtto Sabartfollowing is true:
304e6e37f63SOtto Sabart
305e6e37f63SOtto Sabart  - The current CPU's queue head counter >= the recorded tail counter
306e6e37f63SOtto Sabart    value in rps_dev_flow[i]
307e6e37f63SOtto Sabart  - The current CPU is unset (>= nr_cpu_ids)
308e6e37f63SOtto Sabart  - The current CPU is offline
309e6e37f63SOtto Sabart
310e6e37f63SOtto SabartAfter this check, the packet is sent to the (possibly updated) current
311e6e37f63SOtto SabartCPU. These rules aim to ensure that a flow only moves to a new CPU when
312e6e37f63SOtto Sabartthere are no packets outstanding on the old CPU, as the outstanding
313e6e37f63SOtto Sabartpackets could arrive later than those about to be processed on the new
314e6e37f63SOtto SabartCPU.
315e6e37f63SOtto Sabart
316e6e37f63SOtto Sabart
317e6e37f63SOtto SabartRFS Configuration
318e6e37f63SOtto Sabart-----------------
319e6e37f63SOtto Sabart
320e6e37f63SOtto SabartRFS is only available if the kconfig symbol CONFIG_RPS is enabled (on
321e6e37f63SOtto Sabartby default for SMP). The functionality remains disabled until explicitly
322e6e37f63SOtto Sabartconfigured. The number of entries in the global flow table is set through::
323e6e37f63SOtto Sabart
324e6e37f63SOtto Sabart  /proc/sys/net/core/rps_sock_flow_entries
325e6e37f63SOtto Sabart
326e6e37f63SOtto SabartThe number of entries in the per-queue flow table are set through::
327e6e37f63SOtto Sabart
328e6e37f63SOtto Sabart  /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt
329e6e37f63SOtto Sabart
330e6e37f63SOtto Sabart
331e6e37f63SOtto SabartSuggested Configuration
332e6e37f63SOtto Sabart~~~~~~~~~~~~~~~~~~~~~~~
333e6e37f63SOtto Sabart
334e6e37f63SOtto SabartBoth of these need to be set before RFS is enabled for a receive queue.
335e6e37f63SOtto SabartValues for both are rounded up to the nearest power of two. The
336e6e37f63SOtto Sabartsuggested flow count depends on the expected number of active connections
337e6e37f63SOtto Sabartat any given time, which may be significantly less than the number of open
338e6e37f63SOtto Sabartconnections. We have found that a value of 32768 for rps_sock_flow_entries
339e6e37f63SOtto Sabartworks fairly well on a moderately loaded server.
340e6e37f63SOtto Sabart
341e6e37f63SOtto SabartFor a single queue device, the rps_flow_cnt value for the single queue
342e6e37f63SOtto Sabartwould normally be configured to the same value as rps_sock_flow_entries.
343e6e37f63SOtto SabartFor a multi-queue device, the rps_flow_cnt for each queue might be
344e6e37f63SOtto Sabartconfigured as rps_sock_flow_entries / N, where N is the number of
345e6e37f63SOtto Sabartqueues. So for instance, if rps_sock_flow_entries is set to 32768 and there
346e6e37f63SOtto Sabartare 16 configured receive queues, rps_flow_cnt for each queue might be
347e6e37f63SOtto Sabartconfigured as 2048.
348e6e37f63SOtto Sabart
349e6e37f63SOtto Sabart
350e6e37f63SOtto SabartAccelerated RFS
351e6e37f63SOtto Sabart===============
352e6e37f63SOtto Sabart
353e6e37f63SOtto SabartAccelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated load
354e6e37f63SOtto Sabartbalancing mechanism that uses soft state to steer flows based on where
355e6e37f63SOtto Sabartthe application thread consuming the packets of each flow is running.
356e6e37f63SOtto SabartAccelerated RFS should perform better than RFS since packets are sent
357e6e37f63SOtto Sabartdirectly to a CPU local to the thread consuming the data. The target CPU
358e6e37f63SOtto Sabartwill either be the same CPU where the application runs, or at least a CPU
359e6e37f63SOtto Sabartwhich is local to the application thread’s CPU in the cache hierarchy.
360e6e37f63SOtto Sabart
361e6e37f63SOtto SabartTo enable accelerated RFS, the networking stack calls the
362e6e37f63SOtto Sabartndo_rx_flow_steer driver function to communicate the desired hardware
363e6e37f63SOtto Sabartqueue for packets matching a particular flow. The network stack
364e6e37f63SOtto Sabartautomatically calls this function every time a flow entry in
365e6e37f63SOtto Sabartrps_dev_flow_table is updated. The driver in turn uses a device specific
366e6e37f63SOtto Sabartmethod to program the NIC to steer the packets.
367e6e37f63SOtto Sabart
368e6e37f63SOtto SabartThe hardware queue for a flow is derived from the CPU recorded in
369e6e37f63SOtto Sabartrps_dev_flow_table. The stack consults a CPU to hardware queue map which
370e6e37f63SOtto Sabartis maintained by the NIC driver. This is an auto-generated reverse map of
371e6e37f63SOtto Sabartthe IRQ affinity table shown by /proc/interrupts. Drivers can use
372e6e37f63SOtto Sabartfunctions in the cpu_rmap (“CPU affinity reverse map”) kernel library
373e6e37f63SOtto Sabartto populate the map. For each CPU, the corresponding queue in the map is
374e6e37f63SOtto Sabartset to be one whose processing CPU is closest in cache locality.
375e6e37f63SOtto Sabart
376e6e37f63SOtto Sabart
377e6e37f63SOtto SabartAccelerated RFS Configuration
378e6e37f63SOtto Sabart-----------------------------
379e6e37f63SOtto Sabart
380e6e37f63SOtto SabartAccelerated RFS is only available if the kernel is compiled with
381e6e37f63SOtto SabartCONFIG_RFS_ACCEL and support is provided by the NIC device and driver.
382e6e37f63SOtto SabartIt also requires that ntuple filtering is enabled via ethtool. The map
383e6e37f63SOtto Sabartof CPU to queues is automatically deduced from the IRQ affinities
384e6e37f63SOtto Sabartconfigured for each receive queue by the driver, so no additional
385e6e37f63SOtto Sabartconfiguration should be necessary.
386e6e37f63SOtto Sabart
387e6e37f63SOtto Sabart
388e6e37f63SOtto SabartSuggested Configuration
389e6e37f63SOtto Sabart~~~~~~~~~~~~~~~~~~~~~~~
390e6e37f63SOtto Sabart
391e6e37f63SOtto SabartThis technique should be enabled whenever one wants to use RFS and the
392e6e37f63SOtto SabartNIC supports hardware acceleration.
393e6e37f63SOtto Sabart
394e6e37f63SOtto Sabart
395e6e37f63SOtto SabartXPS: Transmit Packet Steering
396e6e37f63SOtto Sabart=============================
397e6e37f63SOtto Sabart
398e6e37f63SOtto SabartTransmit Packet Steering is a mechanism for intelligently selecting
399e6e37f63SOtto Sabartwhich transmit queue to use when transmitting a packet on a multi-queue
400e6e37f63SOtto Sabartdevice. This can be accomplished by recording two kinds of maps, either
401e6e37f63SOtto Sabarta mapping of CPU to hardware queue(s) or a mapping of receive queue(s)
402e6e37f63SOtto Sabartto hardware transmit queue(s).
403e6e37f63SOtto Sabart
404e6e37f63SOtto Sabart1. XPS using CPUs map
405e6e37f63SOtto Sabart
406e6e37f63SOtto SabartThe goal of this mapping is usually to assign queues
407e6e37f63SOtto Sabartexclusively to a subset of CPUs, where the transmit completions for
408e6e37f63SOtto Sabartthese queues are processed on a CPU within this set. This choice
409e6e37f63SOtto Sabartprovides two benefits. First, contention on the device queue lock is
410e6e37f63SOtto Sabartsignificantly reduced since fewer CPUs contend for the same queue
411e6e37f63SOtto Sabart(contention can be eliminated completely if each CPU has its own
412e6e37f63SOtto Sabarttransmit queue). Secondly, cache miss rate on transmit completion is
413e6e37f63SOtto Sabartreduced, in particular for data cache lines that hold the sk_buff
414e6e37f63SOtto Sabartstructures.
415e6e37f63SOtto Sabart
416e6e37f63SOtto Sabart2. XPS using receive queues map
417e6e37f63SOtto Sabart
418e6e37f63SOtto SabartThis mapping is used to pick transmit queue based on the receive
419e6e37f63SOtto Sabartqueue(s) map configuration set by the administrator. A set of receive
420e6e37f63SOtto Sabartqueues can be mapped to a set of transmit queues (many:many), although
421e6e37f63SOtto Sabartthe common use case is a 1:1 mapping. This will enable sending packets
422e6e37f63SOtto Sabarton the same queue associations for transmit and receive. This is useful for
423e6e37f63SOtto Sabartbusy polling multi-threaded workloads where there are challenges in
424e6e37f63SOtto Sabartassociating a given CPU to a given application thread. The application
425e6e37f63SOtto Sabartthreads are not pinned to CPUs and each thread handles packets
426e6e37f63SOtto Sabartreceived on a single queue. The receive queue number is cached in the
427e6e37f63SOtto Sabartsocket for the connection. In this model, sending the packets on the same
428e6e37f63SOtto Sabarttransmit queue corresponding to the associated receive queue has benefits
429e6e37f63SOtto Sabartin keeping the CPU overhead low. Transmit completion work is locked into
430e6e37f63SOtto Sabartthe same queue-association that a given application is polling on. This
431e6e37f63SOtto Sabartavoids the overhead of triggering an interrupt on another CPU. When the
432e6e37f63SOtto Sabartapplication cleans up the packets during the busy poll, transmit completion
433e6e37f63SOtto Sabartmay be processed along with it in the same thread context and so result in
434e6e37f63SOtto Sabartreduced latency.
435e6e37f63SOtto Sabart
436e6e37f63SOtto SabartXPS is configured per transmit queue by setting a bitmap of
437e6e37f63SOtto SabartCPUs/receive-queues that may use that queue to transmit. The reverse
438e6e37f63SOtto Sabartmapping, from CPUs to transmit queues or from receive-queues to transmit
439e6e37f63SOtto Sabartqueues, is computed and maintained for each network device. When
440e6e37f63SOtto Sabarttransmitting the first packet in a flow, the function get_xps_queue() is
441e6e37f63SOtto Sabartcalled to select a queue. This function uses the ID of the receive queue
442e6e37f63SOtto Sabartfor the socket connection for a match in the receive queue-to-transmit queue
443e6e37f63SOtto Sabartlookup table. Alternatively, this function can also use the ID of the
444e6e37f63SOtto Sabartrunning CPU as a key into the CPU-to-queue lookup table. If the
445e6e37f63SOtto SabartID matches a single queue, that is used for transmission. If multiple
446e6e37f63SOtto Sabartqueues match, one is selected by using the flow hash to compute an index
447e6e37f63SOtto Sabartinto the set. When selecting the transmit queue based on receive queue(s)
448e6e37f63SOtto Sabartmap, the transmit device is not validated against the receive device as it
449e6e37f63SOtto Sabartrequires expensive lookup operation in the datapath.
450e6e37f63SOtto Sabart
451e6e37f63SOtto SabartThe queue chosen for transmitting a particular flow is saved in the
452e6e37f63SOtto Sabartcorresponding socket structure for the flow (e.g. a TCP connection).
453e6e37f63SOtto SabartThis transmit queue is used for subsequent packets sent on the flow to
454e6e37f63SOtto Sabartprevent out of order (ooo) packets. The choice also amortizes the cost
455e6e37f63SOtto Sabartof calling get_xps_queues() over all packets in the flow. To avoid
456e6e37f63SOtto Sabartooo packets, the queue for a flow can subsequently only be changed if
457e6e37f63SOtto Sabartskb->ooo_okay is set for a packet in the flow. This flag indicates that
458e6e37f63SOtto Sabartthere are no outstanding packets in the flow, so the transmit queue can
459e6e37f63SOtto Sabartchange without the risk of generating out of order packets. The
460e6e37f63SOtto Sabarttransport layer is responsible for setting ooo_okay appropriately. TCP,
461e6e37f63SOtto Sabartfor instance, sets the flag when all data for a connection has been
462e6e37f63SOtto Sabartacknowledged.
463e6e37f63SOtto Sabart
464e6e37f63SOtto SabartXPS Configuration
465e6e37f63SOtto Sabart-----------------
466e6e37f63SOtto Sabart
467e6e37f63SOtto SabartXPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
468254941f3SWillem de Bruijndefault for SMP). If compiled in, it is driver dependent whether, and
469254941f3SWillem de Bruijnhow, XPS is configured at device init. The mapping of CPUs/receive-queues
470254941f3SWillem de Bruijnto transmit queue can be inspected and configured using sysfs:
471e6e37f63SOtto Sabart
472e6e37f63SOtto SabartFor selection based on CPUs map::
473e6e37f63SOtto Sabart
474e6e37f63SOtto Sabart  /sys/class/net/<dev>/queues/tx-<n>/xps_cpus
475e6e37f63SOtto Sabart
476e6e37f63SOtto SabartFor selection based on receive-queues map::
477e6e37f63SOtto Sabart
478e6e37f63SOtto Sabart  /sys/class/net/<dev>/queues/tx-<n>/xps_rxqs
479e6e37f63SOtto Sabart
480e6e37f63SOtto Sabart
481e6e37f63SOtto SabartSuggested Configuration
482e6e37f63SOtto Sabart~~~~~~~~~~~~~~~~~~~~~~~
483e6e37f63SOtto Sabart
484e6e37f63SOtto SabartFor a network device with a single transmission queue, XPS configuration
485e6e37f63SOtto Sabarthas no effect, since there is no choice in this case. In a multi-queue
486e6e37f63SOtto Sabartsystem, XPS is preferably configured so that each CPU maps onto one queue.
487e6e37f63SOtto SabartIf there are as many queues as there are CPUs in the system, then each
488e6e37f63SOtto Sabartqueue can also map onto one CPU, resulting in exclusive pairings that
489e6e37f63SOtto Sabartexperience no contention. If there are fewer queues than CPUs, then the
490e6e37f63SOtto Sabartbest CPUs to share a given queue are probably those that share the cache
491e6e37f63SOtto Sabartwith the CPU that processes transmit completions for that queue
492e6e37f63SOtto Sabart(transmit interrupts).
493e6e37f63SOtto Sabart
494e6e37f63SOtto SabartFor transmit queue selection based on receive queue(s), XPS has to be
495e6e37f63SOtto Sabartexplicitly configured mapping receive-queue(s) to transmit queue(s). If the
496e6e37f63SOtto Sabartuser configuration for receive-queue map does not apply, then the transmit
497e6e37f63SOtto Sabartqueue is selected based on the CPUs map.
498e6e37f63SOtto Sabart
499e6e37f63SOtto Sabart
500e6e37f63SOtto SabartPer TX Queue rate limitation
501e6e37f63SOtto Sabart============================
502e6e37f63SOtto Sabart
503e6e37f63SOtto SabartThese are rate-limitation mechanisms implemented by HW, where currently
504e6e37f63SOtto Sabarta max-rate attribute is supported, by setting a Mbps value to::
505e6e37f63SOtto Sabart
506e6e37f63SOtto Sabart  /sys/class/net/<dev>/queues/tx-<n>/tx_maxrate
507e6e37f63SOtto Sabart
508e6e37f63SOtto SabartA value of zero means disabled, and this is the default.
509e6e37f63SOtto Sabart
510e6e37f63SOtto Sabart
511e6e37f63SOtto SabartFurther Information
512e6e37f63SOtto Sabart===================
513e6e37f63SOtto SabartRPS and RFS were introduced in kernel 2.6.35. XPS was incorporated into
514e6e37f63SOtto Sabart2.6.38. Original patches were submitted by Tom Herbert
515e6e37f63SOtto Sabart(therbert@google.com)
516e6e37f63SOtto Sabart
517e6e37f63SOtto SabartAccelerated RFS was introduced in 2.6.35. Original patches were
518e6e37f63SOtto Sabartsubmitted by Ben Hutchings (bwh@kernel.org)
519e6e37f63SOtto Sabart
520e6e37f63SOtto SabartAuthors:
521e6e37f63SOtto Sabart
522e6e37f63SOtto Sabart- Tom Herbert (therbert@google.com)
523e6e37f63SOtto Sabart- Willem de Bruijn (willemb@google.com)
524