xref: /openbmc/linux/Documentation/networking/scaling.rst (revision e6e37f636815075c0055f33f42ebea4fb057def6)
1*e6e37f63SOtto Sabart.. SPDX-License-Identifier: GPL-2.0
2*e6e37f63SOtto Sabart
3*e6e37f63SOtto Sabart=====================================
4*e6e37f63SOtto SabartScaling in the Linux Networking Stack
5*e6e37f63SOtto Sabart=====================================
6*e6e37f63SOtto Sabart
7*e6e37f63SOtto Sabart
8*e6e37f63SOtto SabartIntroduction
9*e6e37f63SOtto Sabart============
10*e6e37f63SOtto Sabart
11*e6e37f63SOtto SabartThis document describes a set of complementary techniques in the Linux
12*e6e37f63SOtto Sabartnetworking stack to increase parallelism and improve performance for
13*e6e37f63SOtto Sabartmulti-processor systems.
14*e6e37f63SOtto Sabart
15*e6e37f63SOtto SabartThe following technologies are described:
16*e6e37f63SOtto Sabart
17*e6e37f63SOtto Sabart- RSS: Receive Side Scaling
18*e6e37f63SOtto Sabart- RPS: Receive Packet Steering
19*e6e37f63SOtto Sabart- RFS: Receive Flow Steering
20*e6e37f63SOtto Sabart- Accelerated Receive Flow Steering
21*e6e37f63SOtto Sabart- XPS: Transmit Packet Steering
22*e6e37f63SOtto Sabart
23*e6e37f63SOtto Sabart
24*e6e37f63SOtto SabartRSS: Receive Side Scaling
25*e6e37f63SOtto Sabart=========================
26*e6e37f63SOtto Sabart
27*e6e37f63SOtto SabartContemporary NICs support multiple receive and transmit descriptor queues
28*e6e37f63SOtto Sabart(multi-queue). On reception, a NIC can send different packets to different
29*e6e37f63SOtto Sabartqueues to distribute processing among CPUs. The NIC distributes packets by
30*e6e37f63SOtto Sabartapplying a filter to each packet that assigns it to one of a small number
31*e6e37f63SOtto Sabartof logical flows. Packets for each flow are steered to a separate receive
32*e6e37f63SOtto Sabartqueue, which in turn can be processed by separate CPUs. This mechanism is
33*e6e37f63SOtto Sabartgenerally known as “Receive-side Scaling” (RSS). The goal of RSS and
34*e6e37f63SOtto Sabartthe other scaling techniques is to increase performance uniformly.
35*e6e37f63SOtto SabartMulti-queue distribution can also be used for traffic prioritization, but
36*e6e37f63SOtto Sabartthat is not the focus of these techniques.
37*e6e37f63SOtto Sabart
38*e6e37f63SOtto SabartThe filter used in RSS is typically a hash function over the network
39*e6e37f63SOtto Sabartand/or transport layer headers-- for example, a 4-tuple hash over
40*e6e37f63SOtto SabartIP addresses and TCP ports of a packet. The most common hardware
41*e6e37f63SOtto Sabartimplementation of RSS uses a 128-entry indirection table where each entry
42*e6e37f63SOtto Sabartstores a queue number. The receive queue for a packet is determined
43*e6e37f63SOtto Sabartby masking out the low order seven bits of the computed hash for the
44*e6e37f63SOtto Sabartpacket (usually a Toeplitz hash), taking this number as a key into the
45*e6e37f63SOtto Sabartindirection table and reading the corresponding value.
46*e6e37f63SOtto Sabart
47*e6e37f63SOtto SabartSome advanced NICs allow steering packets to queues based on
48*e6e37f63SOtto Sabartprogrammable filters. For example, webserver bound TCP port 80 packets
49*e6e37f63SOtto Sabartcan be directed to their own receive queue. Such “n-tuple” filters can
50*e6e37f63SOtto Sabartbe configured from ethtool (--config-ntuple).
51*e6e37f63SOtto Sabart
52*e6e37f63SOtto Sabart
53*e6e37f63SOtto SabartRSS Configuration
54*e6e37f63SOtto Sabart-----------------
55*e6e37f63SOtto Sabart
56*e6e37f63SOtto SabartThe driver for a multi-queue capable NIC typically provides a kernel
57*e6e37f63SOtto Sabartmodule parameter for specifying the number of hardware queues to
58*e6e37f63SOtto Sabartconfigure. In the bnx2x driver, for instance, this parameter is called
59*e6e37f63SOtto Sabartnum_queues. A typical RSS configuration would be to have one receive queue
60*e6e37f63SOtto Sabartfor each CPU if the device supports enough queues, or otherwise at least
61*e6e37f63SOtto Sabartone for each memory domain, where a memory domain is a set of CPUs that
62*e6e37f63SOtto Sabartshare a particular memory level (L1, L2, NUMA node, etc.).
63*e6e37f63SOtto Sabart
64*e6e37f63SOtto SabartThe indirection table of an RSS device, which resolves a queue by masked
65*e6e37f63SOtto Sabarthash, is usually programmed by the driver at initialization. The
66*e6e37f63SOtto Sabartdefault mapping is to distribute the queues evenly in the table, but the
67*e6e37f63SOtto Sabartindirection table can be retrieved and modified at runtime using ethtool
68*e6e37f63SOtto Sabartcommands (--show-rxfh-indir and --set-rxfh-indir). Modifying the
69*e6e37f63SOtto Sabartindirection table could be done to give different queues different
70*e6e37f63SOtto Sabartrelative weights.
71*e6e37f63SOtto Sabart
72*e6e37f63SOtto Sabart
73*e6e37f63SOtto SabartRSS IRQ Configuration
74*e6e37f63SOtto Sabart~~~~~~~~~~~~~~~~~~~~~
75*e6e37f63SOtto Sabart
76*e6e37f63SOtto SabartEach receive queue has a separate IRQ associated with it. The NIC triggers
77*e6e37f63SOtto Sabartthis to notify a CPU when new packets arrive on the given queue. The
78*e6e37f63SOtto Sabartsignaling path for PCIe devices uses message signaled interrupts (MSI-X),
79*e6e37f63SOtto Sabartthat can route each interrupt to a particular CPU. The active mapping
80*e6e37f63SOtto Sabartof queues to IRQs can be determined from /proc/interrupts. By default,
81*e6e37f63SOtto Sabartan IRQ may be handled on any CPU. Because a non-negligible part of packet
82*e6e37f63SOtto Sabartprocessing takes place in receive interrupt handling, it is advantageous
83*e6e37f63SOtto Sabartto spread receive interrupts between CPUs. To manually adjust the IRQ
84*e6e37f63SOtto Sabartaffinity of each interrupt see Documentation/IRQ-affinity.txt. Some systems
85*e6e37f63SOtto Sabartwill be running irqbalance, a daemon that dynamically optimizes IRQ
86*e6e37f63SOtto Sabartassignments and as a result may override any manual settings.
87*e6e37f63SOtto Sabart
88*e6e37f63SOtto Sabart
89*e6e37f63SOtto SabartSuggested Configuration
90*e6e37f63SOtto Sabart~~~~~~~~~~~~~~~~~~~~~~~
91*e6e37f63SOtto Sabart
92*e6e37f63SOtto SabartRSS should be enabled when latency is a concern or whenever receive
93*e6e37f63SOtto Sabartinterrupt processing forms a bottleneck. Spreading load between CPUs
94*e6e37f63SOtto Sabartdecreases queue length. For low latency networking, the optimal setting
95*e6e37f63SOtto Sabartis to allocate as many queues as there are CPUs in the system (or the
96*e6e37f63SOtto SabartNIC maximum, if lower). The most efficient high-rate configuration
97*e6e37f63SOtto Sabartis likely the one with the smallest number of receive queues where no
98*e6e37f63SOtto Sabartreceive queue overflows due to a saturated CPU, because in default
99*e6e37f63SOtto Sabartmode with interrupt coalescing enabled, the aggregate number of
100*e6e37f63SOtto Sabartinterrupts (and thus work) grows with each additional queue.
101*e6e37f63SOtto Sabart
102*e6e37f63SOtto SabartPer-cpu load can be observed using the mpstat utility, but note that on
103*e6e37f63SOtto Sabartprocessors with hyperthreading (HT), each hyperthread is represented as
104*e6e37f63SOtto Sabarta separate CPU. For interrupt handling, HT has shown no benefit in
105*e6e37f63SOtto Sabartinitial tests, so limit the number of queues to the number of CPU cores
106*e6e37f63SOtto Sabartin the system.
107*e6e37f63SOtto Sabart
108*e6e37f63SOtto Sabart
109*e6e37f63SOtto SabartRPS: Receive Packet Steering
110*e6e37f63SOtto Sabart============================
111*e6e37f63SOtto Sabart
112*e6e37f63SOtto SabartReceive Packet Steering (RPS) is logically a software implementation of
113*e6e37f63SOtto SabartRSS. Being in software, it is necessarily called later in the datapath.
114*e6e37f63SOtto SabartWhereas RSS selects the queue and hence CPU that will run the hardware
115*e6e37f63SOtto Sabartinterrupt handler, RPS selects the CPU to perform protocol processing
116*e6e37f63SOtto Sabartabove the interrupt handler. This is accomplished by placing the packet
117*e6e37f63SOtto Sabarton the desired CPU’s backlog queue and waking up the CPU for processing.
118*e6e37f63SOtto SabartRPS has some advantages over RSS:
119*e6e37f63SOtto Sabart
120*e6e37f63SOtto Sabart1) it can be used with any NIC
121*e6e37f63SOtto Sabart2) software filters can easily be added to hash over new protocols
122*e6e37f63SOtto Sabart3) it does not increase hardware device interrupt rate (although it does
123*e6e37f63SOtto Sabart   introduce inter-processor interrupts (IPIs))
124*e6e37f63SOtto Sabart
125*e6e37f63SOtto SabartRPS is called during bottom half of the receive interrupt handler, when
126*e6e37f63SOtto Sabarta driver sends a packet up the network stack with netif_rx() or
127*e6e37f63SOtto Sabartnetif_receive_skb(). These call the get_rps_cpu() function, which
128*e6e37f63SOtto Sabartselects the queue that should process a packet.
129*e6e37f63SOtto Sabart
130*e6e37f63SOtto SabartThe first step in determining the target CPU for RPS is to calculate a
131*e6e37f63SOtto Sabartflow hash over the packet’s addresses or ports (2-tuple or 4-tuple hash
132*e6e37f63SOtto Sabartdepending on the protocol). This serves as a consistent hash of the
133*e6e37f63SOtto Sabartassociated flow of the packet. The hash is either provided by hardware
134*e6e37f63SOtto Sabartor will be computed in the stack. Capable hardware can pass the hash in
135*e6e37f63SOtto Sabartthe receive descriptor for the packet; this would usually be the same
136*e6e37f63SOtto Sabarthash used for RSS (e.g. computed Toeplitz hash). The hash is saved in
137*e6e37f63SOtto Sabartskb->hash and can be used elsewhere in the stack as a hash of the
138*e6e37f63SOtto Sabartpacket’s flow.
139*e6e37f63SOtto Sabart
140*e6e37f63SOtto SabartEach receive hardware queue has an associated list of CPUs to which
141*e6e37f63SOtto SabartRPS may enqueue packets for processing. For each received packet,
142*e6e37f63SOtto Sabartan index into the list is computed from the flow hash modulo the size
143*e6e37f63SOtto Sabartof the list. The indexed CPU is the target for processing the packet,
144*e6e37f63SOtto Sabartand the packet is queued to the tail of that CPU’s backlog queue. At
145*e6e37f63SOtto Sabartthe end of the bottom half routine, IPIs are sent to any CPUs for which
146*e6e37f63SOtto Sabartpackets have been queued to their backlog queue. The IPI wakes backlog
147*e6e37f63SOtto Sabartprocessing on the remote CPU, and any queued packets are then processed
148*e6e37f63SOtto Sabartup the networking stack.
149*e6e37f63SOtto Sabart
150*e6e37f63SOtto Sabart
151*e6e37f63SOtto SabartRPS Configuration
152*e6e37f63SOtto Sabart-----------------
153*e6e37f63SOtto Sabart
154*e6e37f63SOtto SabartRPS requires a kernel compiled with the CONFIG_RPS kconfig symbol (on
155*e6e37f63SOtto Sabartby default for SMP). Even when compiled in, RPS remains disabled until
156*e6e37f63SOtto Sabartexplicitly configured. The list of CPUs to which RPS may forward traffic
157*e6e37f63SOtto Sabartcan be configured for each receive queue using a sysfs file entry::
158*e6e37f63SOtto Sabart
159*e6e37f63SOtto Sabart  /sys/class/net/<dev>/queues/rx-<n>/rps_cpus
160*e6e37f63SOtto Sabart
161*e6e37f63SOtto SabartThis file implements a bitmap of CPUs. RPS is disabled when it is zero
162*e6e37f63SOtto Sabart(the default), in which case packets are processed on the interrupting
163*e6e37f63SOtto SabartCPU. Documentation/IRQ-affinity.txt explains how CPUs are assigned to
164*e6e37f63SOtto Sabartthe bitmap.
165*e6e37f63SOtto Sabart
166*e6e37f63SOtto Sabart
167*e6e37f63SOtto SabartSuggested Configuration
168*e6e37f63SOtto Sabart~~~~~~~~~~~~~~~~~~~~~~~
169*e6e37f63SOtto Sabart
170*e6e37f63SOtto SabartFor a single queue device, a typical RPS configuration would be to set
171*e6e37f63SOtto Sabartthe rps_cpus to the CPUs in the same memory domain of the interrupting
172*e6e37f63SOtto SabartCPU. If NUMA locality is not an issue, this could also be all CPUs in
173*e6e37f63SOtto Sabartthe system. At high interrupt rate, it might be wise to exclude the
174*e6e37f63SOtto Sabartinterrupting CPU from the map since that already performs much work.
175*e6e37f63SOtto Sabart
176*e6e37f63SOtto SabartFor a multi-queue system, if RSS is configured so that a hardware
177*e6e37f63SOtto Sabartreceive queue is mapped to each CPU, then RPS is probably redundant
178*e6e37f63SOtto Sabartand unnecessary. If there are fewer hardware queues than CPUs, then
179*e6e37f63SOtto SabartRPS might be beneficial if the rps_cpus for each queue are the ones that
180*e6e37f63SOtto Sabartshare the same memory domain as the interrupting CPU for that queue.
181*e6e37f63SOtto Sabart
182*e6e37f63SOtto Sabart
183*e6e37f63SOtto SabartRPS Flow Limit
184*e6e37f63SOtto Sabart--------------
185*e6e37f63SOtto Sabart
186*e6e37f63SOtto SabartRPS scales kernel receive processing across CPUs without introducing
187*e6e37f63SOtto Sabartreordering. The trade-off to sending all packets from the same flow
188*e6e37f63SOtto Sabartto the same CPU is CPU load imbalance if flows vary in packet rate.
189*e6e37f63SOtto SabartIn the extreme case a single flow dominates traffic. Especially on
190*e6e37f63SOtto Sabartcommon server workloads with many concurrent connections, such
191*e6e37f63SOtto Sabartbehavior indicates a problem such as a misconfiguration or spoofed
192*e6e37f63SOtto Sabartsource Denial of Service attack.
193*e6e37f63SOtto Sabart
194*e6e37f63SOtto SabartFlow Limit is an optional RPS feature that prioritizes small flows
195*e6e37f63SOtto Sabartduring CPU contention by dropping packets from large flows slightly
196*e6e37f63SOtto Sabartahead of those from small flows. It is active only when an RPS or RFS
197*e6e37f63SOtto Sabartdestination CPU approaches saturation.  Once a CPU's input packet
198*e6e37f63SOtto Sabartqueue exceeds half the maximum queue length (as set by sysctl
199*e6e37f63SOtto Sabartnet.core.netdev_max_backlog), the kernel starts a per-flow packet
200*e6e37f63SOtto Sabartcount over the last 256 packets. If a flow exceeds a set ratio (by
201*e6e37f63SOtto Sabartdefault, half) of these packets when a new packet arrives, then the
202*e6e37f63SOtto Sabartnew packet is dropped. Packets from other flows are still only
203*e6e37f63SOtto Sabartdropped once the input packet queue reaches netdev_max_backlog.
204*e6e37f63SOtto SabartNo packets are dropped when the input packet queue length is below
205*e6e37f63SOtto Sabartthe threshold, so flow limit does not sever connections outright:
206*e6e37f63SOtto Sabarteven large flows maintain connectivity.
207*e6e37f63SOtto Sabart
208*e6e37f63SOtto Sabart
209*e6e37f63SOtto SabartInterface
210*e6e37f63SOtto Sabart~~~~~~~~~
211*e6e37f63SOtto Sabart
212*e6e37f63SOtto SabartFlow limit is compiled in by default (CONFIG_NET_FLOW_LIMIT), but not
213*e6e37f63SOtto Sabartturned on. It is implemented for each CPU independently (to avoid lock
214*e6e37f63SOtto Sabartand cache contention) and toggled per CPU by setting the relevant bit
215*e6e37f63SOtto Sabartin sysctl net.core.flow_limit_cpu_bitmap. It exposes the same CPU
216*e6e37f63SOtto Sabartbitmap interface as rps_cpus (see above) when called from procfs::
217*e6e37f63SOtto Sabart
218*e6e37f63SOtto Sabart  /proc/sys/net/core/flow_limit_cpu_bitmap
219*e6e37f63SOtto Sabart
220*e6e37f63SOtto SabartPer-flow rate is calculated by hashing each packet into a hashtable
221*e6e37f63SOtto Sabartbucket and incrementing a per-bucket counter. The hash function is
222*e6e37f63SOtto Sabartthe same that selects a CPU in RPS, but as the number of buckets can
223*e6e37f63SOtto Sabartbe much larger than the number of CPUs, flow limit has finer-grained
224*e6e37f63SOtto Sabartidentification of large flows and fewer false positives. The default
225*e6e37f63SOtto Sabarttable has 4096 buckets. This value can be modified through sysctl::
226*e6e37f63SOtto Sabart
227*e6e37f63SOtto Sabart  net.core.flow_limit_table_len
228*e6e37f63SOtto Sabart
229*e6e37f63SOtto SabartThe value is only consulted when a new table is allocated. Modifying
230*e6e37f63SOtto Sabartit does not update active tables.
231*e6e37f63SOtto Sabart
232*e6e37f63SOtto Sabart
233*e6e37f63SOtto SabartSuggested Configuration
234*e6e37f63SOtto Sabart~~~~~~~~~~~~~~~~~~~~~~~
235*e6e37f63SOtto Sabart
236*e6e37f63SOtto SabartFlow limit is useful on systems with many concurrent connections,
237*e6e37f63SOtto Sabartwhere a single connection taking up 50% of a CPU indicates a problem.
238*e6e37f63SOtto SabartIn such environments, enable the feature on all CPUs that handle
239*e6e37f63SOtto Sabartnetwork rx interrupts (as set in /proc/irq/N/smp_affinity).
240*e6e37f63SOtto Sabart
241*e6e37f63SOtto SabartThe feature depends on the input packet queue length to exceed
242*e6e37f63SOtto Sabartthe flow limit threshold (50%) + the flow history length (256).
243*e6e37f63SOtto SabartSetting net.core.netdev_max_backlog to either 1000 or 10000
244*e6e37f63SOtto Sabartperformed well in experiments.
245*e6e37f63SOtto Sabart
246*e6e37f63SOtto Sabart
247*e6e37f63SOtto SabartRFS: Receive Flow Steering
248*e6e37f63SOtto Sabart==========================
249*e6e37f63SOtto Sabart
250*e6e37f63SOtto SabartWhile RPS steers packets solely based on hash, and thus generally
251*e6e37f63SOtto Sabartprovides good load distribution, it does not take into account
252*e6e37f63SOtto Sabartapplication locality. This is accomplished by Receive Flow Steering
253*e6e37f63SOtto Sabart(RFS). The goal of RFS is to increase datacache hitrate by steering
254*e6e37f63SOtto Sabartkernel processing of packets to the CPU where the application thread
255*e6e37f63SOtto Sabartconsuming the packet is running. RFS relies on the same RPS mechanisms
256*e6e37f63SOtto Sabartto enqueue packets onto the backlog of another CPU and to wake up that
257*e6e37f63SOtto SabartCPU.
258*e6e37f63SOtto Sabart
259*e6e37f63SOtto SabartIn RFS, packets are not forwarded directly by the value of their hash,
260*e6e37f63SOtto Sabartbut the hash is used as index into a flow lookup table. This table maps
261*e6e37f63SOtto Sabartflows to the CPUs where those flows are being processed. The flow hash
262*e6e37f63SOtto Sabart(see RPS section above) is used to calculate the index into this table.
263*e6e37f63SOtto SabartThe CPU recorded in each entry is the one which last processed the flow.
264*e6e37f63SOtto SabartIf an entry does not hold a valid CPU, then packets mapped to that entry
265*e6e37f63SOtto Sabartare steered using plain RPS. Multiple table entries may point to the
266*e6e37f63SOtto Sabartsame CPU. Indeed, with many flows and few CPUs, it is very likely that
267*e6e37f63SOtto Sabarta single application thread handles flows with many different flow hashes.
268*e6e37f63SOtto Sabart
269*e6e37f63SOtto Sabartrps_sock_flow_table is a global flow table that contains the *desired* CPU
270*e6e37f63SOtto Sabartfor flows: the CPU that is currently processing the flow in userspace.
271*e6e37f63SOtto SabartEach table value is a CPU index that is updated during calls to recvmsg
272*e6e37f63SOtto Sabartand sendmsg (specifically, inet_recvmsg(), inet_sendmsg(), inet_sendpage()
273*e6e37f63SOtto Sabartand tcp_splice_read()).
274*e6e37f63SOtto Sabart
275*e6e37f63SOtto SabartWhen the scheduler moves a thread to a new CPU while it has outstanding
276*e6e37f63SOtto Sabartreceive packets on the old CPU, packets may arrive out of order. To
277*e6e37f63SOtto Sabartavoid this, RFS uses a second flow table to track outstanding packets
278*e6e37f63SOtto Sabartfor each flow: rps_dev_flow_table is a table specific to each hardware
279*e6e37f63SOtto Sabartreceive queue of each device. Each table value stores a CPU index and a
280*e6e37f63SOtto Sabartcounter. The CPU index represents the *current* CPU onto which packets
281*e6e37f63SOtto Sabartfor this flow are enqueued for further kernel processing. Ideally, kernel
282*e6e37f63SOtto Sabartand userspace processing occur on the same CPU, and hence the CPU index
283*e6e37f63SOtto Sabartin both tables is identical. This is likely false if the scheduler has
284*e6e37f63SOtto Sabartrecently migrated a userspace thread while the kernel still has packets
285*e6e37f63SOtto Sabartenqueued for kernel processing on the old CPU.
286*e6e37f63SOtto Sabart
287*e6e37f63SOtto SabartThe counter in rps_dev_flow_table values records the length of the current
288*e6e37f63SOtto SabartCPU's backlog when a packet in this flow was last enqueued. Each backlog
289*e6e37f63SOtto Sabartqueue has a head counter that is incremented on dequeue. A tail counter
290*e6e37f63SOtto Sabartis computed as head counter + queue length. In other words, the counter
291*e6e37f63SOtto Sabartin rps_dev_flow[i] records the last element in flow i that has
292*e6e37f63SOtto Sabartbeen enqueued onto the currently designated CPU for flow i (of course,
293*e6e37f63SOtto Sabartentry i is actually selected by hash and multiple flows may hash to the
294*e6e37f63SOtto Sabartsame entry i).
295*e6e37f63SOtto Sabart
296*e6e37f63SOtto SabartAnd now the trick for avoiding out of order packets: when selecting the
297*e6e37f63SOtto SabartCPU for packet processing (from get_rps_cpu()) the rps_sock_flow table
298*e6e37f63SOtto Sabartand the rps_dev_flow table of the queue that the packet was received on
299*e6e37f63SOtto Sabartare compared. If the desired CPU for the flow (found in the
300*e6e37f63SOtto Sabartrps_sock_flow table) matches the current CPU (found in the rps_dev_flow
301*e6e37f63SOtto Sabarttable), the packet is enqueued onto that CPU’s backlog. If they differ,
302*e6e37f63SOtto Sabartthe current CPU is updated to match the desired CPU if one of the
303*e6e37f63SOtto Sabartfollowing is true:
304*e6e37f63SOtto Sabart
305*e6e37f63SOtto Sabart  - The current CPU's queue head counter >= the recorded tail counter
306*e6e37f63SOtto Sabart    value in rps_dev_flow[i]
307*e6e37f63SOtto Sabart  - The current CPU is unset (>= nr_cpu_ids)
308*e6e37f63SOtto Sabart  - The current CPU is offline
309*e6e37f63SOtto Sabart
310*e6e37f63SOtto SabartAfter this check, the packet is sent to the (possibly updated) current
311*e6e37f63SOtto SabartCPU. These rules aim to ensure that a flow only moves to a new CPU when
312*e6e37f63SOtto Sabartthere are no packets outstanding on the old CPU, as the outstanding
313*e6e37f63SOtto Sabartpackets could arrive later than those about to be processed on the new
314*e6e37f63SOtto SabartCPU.
315*e6e37f63SOtto Sabart
316*e6e37f63SOtto Sabart
317*e6e37f63SOtto SabartRFS Configuration
318*e6e37f63SOtto Sabart-----------------
319*e6e37f63SOtto Sabart
320*e6e37f63SOtto SabartRFS is only available if the kconfig symbol CONFIG_RPS is enabled (on
321*e6e37f63SOtto Sabartby default for SMP). The functionality remains disabled until explicitly
322*e6e37f63SOtto Sabartconfigured. The number of entries in the global flow table is set through::
323*e6e37f63SOtto Sabart
324*e6e37f63SOtto Sabart  /proc/sys/net/core/rps_sock_flow_entries
325*e6e37f63SOtto Sabart
326*e6e37f63SOtto SabartThe number of entries in the per-queue flow table are set through::
327*e6e37f63SOtto Sabart
328*e6e37f63SOtto Sabart  /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt
329*e6e37f63SOtto Sabart
330*e6e37f63SOtto Sabart
331*e6e37f63SOtto SabartSuggested Configuration
332*e6e37f63SOtto Sabart~~~~~~~~~~~~~~~~~~~~~~~
333*e6e37f63SOtto Sabart
334*e6e37f63SOtto SabartBoth of these need to be set before RFS is enabled for a receive queue.
335*e6e37f63SOtto SabartValues for both are rounded up to the nearest power of two. The
336*e6e37f63SOtto Sabartsuggested flow count depends on the expected number of active connections
337*e6e37f63SOtto Sabartat any given time, which may be significantly less than the number of open
338*e6e37f63SOtto Sabartconnections. We have found that a value of 32768 for rps_sock_flow_entries
339*e6e37f63SOtto Sabartworks fairly well on a moderately loaded server.
340*e6e37f63SOtto Sabart
341*e6e37f63SOtto SabartFor a single queue device, the rps_flow_cnt value for the single queue
342*e6e37f63SOtto Sabartwould normally be configured to the same value as rps_sock_flow_entries.
343*e6e37f63SOtto SabartFor a multi-queue device, the rps_flow_cnt for each queue might be
344*e6e37f63SOtto Sabartconfigured as rps_sock_flow_entries / N, where N is the number of
345*e6e37f63SOtto Sabartqueues. So for instance, if rps_sock_flow_entries is set to 32768 and there
346*e6e37f63SOtto Sabartare 16 configured receive queues, rps_flow_cnt for each queue might be
347*e6e37f63SOtto Sabartconfigured as 2048.
348*e6e37f63SOtto Sabart
349*e6e37f63SOtto Sabart
350*e6e37f63SOtto SabartAccelerated RFS
351*e6e37f63SOtto Sabart===============
352*e6e37f63SOtto Sabart
353*e6e37f63SOtto SabartAccelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated load
354*e6e37f63SOtto Sabartbalancing mechanism that uses soft state to steer flows based on where
355*e6e37f63SOtto Sabartthe application thread consuming the packets of each flow is running.
356*e6e37f63SOtto SabartAccelerated RFS should perform better than RFS since packets are sent
357*e6e37f63SOtto Sabartdirectly to a CPU local to the thread consuming the data. The target CPU
358*e6e37f63SOtto Sabartwill either be the same CPU where the application runs, or at least a CPU
359*e6e37f63SOtto Sabartwhich is local to the application thread’s CPU in the cache hierarchy.
360*e6e37f63SOtto Sabart
361*e6e37f63SOtto SabartTo enable accelerated RFS, the networking stack calls the
362*e6e37f63SOtto Sabartndo_rx_flow_steer driver function to communicate the desired hardware
363*e6e37f63SOtto Sabartqueue for packets matching a particular flow. The network stack
364*e6e37f63SOtto Sabartautomatically calls this function every time a flow entry in
365*e6e37f63SOtto Sabartrps_dev_flow_table is updated. The driver in turn uses a device specific
366*e6e37f63SOtto Sabartmethod to program the NIC to steer the packets.
367*e6e37f63SOtto Sabart
368*e6e37f63SOtto SabartThe hardware queue for a flow is derived from the CPU recorded in
369*e6e37f63SOtto Sabartrps_dev_flow_table. The stack consults a CPU to hardware queue map which
370*e6e37f63SOtto Sabartis maintained by the NIC driver. This is an auto-generated reverse map of
371*e6e37f63SOtto Sabartthe IRQ affinity table shown by /proc/interrupts. Drivers can use
372*e6e37f63SOtto Sabartfunctions in the cpu_rmap (“CPU affinity reverse map”) kernel library
373*e6e37f63SOtto Sabartto populate the map. For each CPU, the corresponding queue in the map is
374*e6e37f63SOtto Sabartset to be one whose processing CPU is closest in cache locality.
375*e6e37f63SOtto Sabart
376*e6e37f63SOtto Sabart
377*e6e37f63SOtto SabartAccelerated RFS Configuration
378*e6e37f63SOtto Sabart-----------------------------
379*e6e37f63SOtto Sabart
380*e6e37f63SOtto SabartAccelerated RFS is only available if the kernel is compiled with
381*e6e37f63SOtto SabartCONFIG_RFS_ACCEL and support is provided by the NIC device and driver.
382*e6e37f63SOtto SabartIt also requires that ntuple filtering is enabled via ethtool. The map
383*e6e37f63SOtto Sabartof CPU to queues is automatically deduced from the IRQ affinities
384*e6e37f63SOtto Sabartconfigured for each receive queue by the driver, so no additional
385*e6e37f63SOtto Sabartconfiguration should be necessary.
386*e6e37f63SOtto Sabart
387*e6e37f63SOtto Sabart
388*e6e37f63SOtto SabartSuggested Configuration
389*e6e37f63SOtto Sabart~~~~~~~~~~~~~~~~~~~~~~~
390*e6e37f63SOtto Sabart
391*e6e37f63SOtto SabartThis technique should be enabled whenever one wants to use RFS and the
392*e6e37f63SOtto SabartNIC supports hardware acceleration.
393*e6e37f63SOtto Sabart
394*e6e37f63SOtto Sabart
395*e6e37f63SOtto SabartXPS: Transmit Packet Steering
396*e6e37f63SOtto Sabart=============================
397*e6e37f63SOtto Sabart
398*e6e37f63SOtto SabartTransmit Packet Steering is a mechanism for intelligently selecting
399*e6e37f63SOtto Sabartwhich transmit queue to use when transmitting a packet on a multi-queue
400*e6e37f63SOtto Sabartdevice. This can be accomplished by recording two kinds of maps, either
401*e6e37f63SOtto Sabarta mapping of CPU to hardware queue(s) or a mapping of receive queue(s)
402*e6e37f63SOtto Sabartto hardware transmit queue(s).
403*e6e37f63SOtto Sabart
404*e6e37f63SOtto Sabart1. XPS using CPUs map
405*e6e37f63SOtto Sabart
406*e6e37f63SOtto SabartThe goal of this mapping is usually to assign queues
407*e6e37f63SOtto Sabartexclusively to a subset of CPUs, where the transmit completions for
408*e6e37f63SOtto Sabartthese queues are processed on a CPU within this set. This choice
409*e6e37f63SOtto Sabartprovides two benefits. First, contention on the device queue lock is
410*e6e37f63SOtto Sabartsignificantly reduced since fewer CPUs contend for the same queue
411*e6e37f63SOtto Sabart(contention can be eliminated completely if each CPU has its own
412*e6e37f63SOtto Sabarttransmit queue). Secondly, cache miss rate on transmit completion is
413*e6e37f63SOtto Sabartreduced, in particular for data cache lines that hold the sk_buff
414*e6e37f63SOtto Sabartstructures.
415*e6e37f63SOtto Sabart
416*e6e37f63SOtto Sabart2. XPS using receive queues map
417*e6e37f63SOtto Sabart
418*e6e37f63SOtto SabartThis mapping is used to pick transmit queue based on the receive
419*e6e37f63SOtto Sabartqueue(s) map configuration set by the administrator. A set of receive
420*e6e37f63SOtto Sabartqueues can be mapped to a set of transmit queues (many:many), although
421*e6e37f63SOtto Sabartthe common use case is a 1:1 mapping. This will enable sending packets
422*e6e37f63SOtto Sabarton the same queue associations for transmit and receive. This is useful for
423*e6e37f63SOtto Sabartbusy polling multi-threaded workloads where there are challenges in
424*e6e37f63SOtto Sabartassociating a given CPU to a given application thread. The application
425*e6e37f63SOtto Sabartthreads are not pinned to CPUs and each thread handles packets
426*e6e37f63SOtto Sabartreceived on a single queue. The receive queue number is cached in the
427*e6e37f63SOtto Sabartsocket for the connection. In this model, sending the packets on the same
428*e6e37f63SOtto Sabarttransmit queue corresponding to the associated receive queue has benefits
429*e6e37f63SOtto Sabartin keeping the CPU overhead low. Transmit completion work is locked into
430*e6e37f63SOtto Sabartthe same queue-association that a given application is polling on. This
431*e6e37f63SOtto Sabartavoids the overhead of triggering an interrupt on another CPU. When the
432*e6e37f63SOtto Sabartapplication cleans up the packets during the busy poll, transmit completion
433*e6e37f63SOtto Sabartmay be processed along with it in the same thread context and so result in
434*e6e37f63SOtto Sabartreduced latency.
435*e6e37f63SOtto Sabart
436*e6e37f63SOtto SabartXPS is configured per transmit queue by setting a bitmap of
437*e6e37f63SOtto SabartCPUs/receive-queues that may use that queue to transmit. The reverse
438*e6e37f63SOtto Sabartmapping, from CPUs to transmit queues or from receive-queues to transmit
439*e6e37f63SOtto Sabartqueues, is computed and maintained for each network device. When
440*e6e37f63SOtto Sabarttransmitting the first packet in a flow, the function get_xps_queue() is
441*e6e37f63SOtto Sabartcalled to select a queue. This function uses the ID of the receive queue
442*e6e37f63SOtto Sabartfor the socket connection for a match in the receive queue-to-transmit queue
443*e6e37f63SOtto Sabartlookup table. Alternatively, this function can also use the ID of the
444*e6e37f63SOtto Sabartrunning CPU as a key into the CPU-to-queue lookup table. If the
445*e6e37f63SOtto SabartID matches a single queue, that is used for transmission. If multiple
446*e6e37f63SOtto Sabartqueues match, one is selected by using the flow hash to compute an index
447*e6e37f63SOtto Sabartinto the set. When selecting the transmit queue based on receive queue(s)
448*e6e37f63SOtto Sabartmap, the transmit device is not validated against the receive device as it
449*e6e37f63SOtto Sabartrequires expensive lookup operation in the datapath.
450*e6e37f63SOtto Sabart
451*e6e37f63SOtto SabartThe queue chosen for transmitting a particular flow is saved in the
452*e6e37f63SOtto Sabartcorresponding socket structure for the flow (e.g. a TCP connection).
453*e6e37f63SOtto SabartThis transmit queue is used for subsequent packets sent on the flow to
454*e6e37f63SOtto Sabartprevent out of order (ooo) packets. The choice also amortizes the cost
455*e6e37f63SOtto Sabartof calling get_xps_queues() over all packets in the flow. To avoid
456*e6e37f63SOtto Sabartooo packets, the queue for a flow can subsequently only be changed if
457*e6e37f63SOtto Sabartskb->ooo_okay is set for a packet in the flow. This flag indicates that
458*e6e37f63SOtto Sabartthere are no outstanding packets in the flow, so the transmit queue can
459*e6e37f63SOtto Sabartchange without the risk of generating out of order packets. The
460*e6e37f63SOtto Sabarttransport layer is responsible for setting ooo_okay appropriately. TCP,
461*e6e37f63SOtto Sabartfor instance, sets the flag when all data for a connection has been
462*e6e37f63SOtto Sabartacknowledged.
463*e6e37f63SOtto Sabart
464*e6e37f63SOtto SabartXPS Configuration
465*e6e37f63SOtto Sabart-----------------
466*e6e37f63SOtto Sabart
467*e6e37f63SOtto SabartXPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
468*e6e37f63SOtto Sabartdefault for SMP). The functionality remains disabled until explicitly
469*e6e37f63SOtto Sabartconfigured. To enable XPS, the bitmap of CPUs/receive-queues that may
470*e6e37f63SOtto Sabartuse a transmit queue is configured using the sysfs file entry:
471*e6e37f63SOtto Sabart
472*e6e37f63SOtto SabartFor selection based on CPUs map::
473*e6e37f63SOtto Sabart
474*e6e37f63SOtto Sabart  /sys/class/net/<dev>/queues/tx-<n>/xps_cpus
475*e6e37f63SOtto Sabart
476*e6e37f63SOtto SabartFor selection based on receive-queues map::
477*e6e37f63SOtto Sabart
478*e6e37f63SOtto Sabart  /sys/class/net/<dev>/queues/tx-<n>/xps_rxqs
479*e6e37f63SOtto Sabart
480*e6e37f63SOtto Sabart
481*e6e37f63SOtto SabartSuggested Configuration
482*e6e37f63SOtto Sabart~~~~~~~~~~~~~~~~~~~~~~~
483*e6e37f63SOtto Sabart
484*e6e37f63SOtto SabartFor a network device with a single transmission queue, XPS configuration
485*e6e37f63SOtto Sabarthas no effect, since there is no choice in this case. In a multi-queue
486*e6e37f63SOtto Sabartsystem, XPS is preferably configured so that each CPU maps onto one queue.
487*e6e37f63SOtto SabartIf there are as many queues as there are CPUs in the system, then each
488*e6e37f63SOtto Sabartqueue can also map onto one CPU, resulting in exclusive pairings that
489*e6e37f63SOtto Sabartexperience no contention. If there are fewer queues than CPUs, then the
490*e6e37f63SOtto Sabartbest CPUs to share a given queue are probably those that share the cache
491*e6e37f63SOtto Sabartwith the CPU that processes transmit completions for that queue
492*e6e37f63SOtto Sabart(transmit interrupts).
493*e6e37f63SOtto Sabart
494*e6e37f63SOtto SabartFor transmit queue selection based on receive queue(s), XPS has to be
495*e6e37f63SOtto Sabartexplicitly configured mapping receive-queue(s) to transmit queue(s). If the
496*e6e37f63SOtto Sabartuser configuration for receive-queue map does not apply, then the transmit
497*e6e37f63SOtto Sabartqueue is selected based on the CPUs map.
498*e6e37f63SOtto Sabart
499*e6e37f63SOtto Sabart
500*e6e37f63SOtto SabartPer TX Queue rate limitation
501*e6e37f63SOtto Sabart============================
502*e6e37f63SOtto Sabart
503*e6e37f63SOtto SabartThese are rate-limitation mechanisms implemented by HW, where currently
504*e6e37f63SOtto Sabarta max-rate attribute is supported, by setting a Mbps value to::
505*e6e37f63SOtto Sabart
506*e6e37f63SOtto Sabart  /sys/class/net/<dev>/queues/tx-<n>/tx_maxrate
507*e6e37f63SOtto Sabart
508*e6e37f63SOtto SabartA value of zero means disabled, and this is the default.
509*e6e37f63SOtto Sabart
510*e6e37f63SOtto Sabart
511*e6e37f63SOtto SabartFurther Information
512*e6e37f63SOtto Sabart===================
513*e6e37f63SOtto SabartRPS and RFS were introduced in kernel 2.6.35. XPS was incorporated into
514*e6e37f63SOtto Sabart2.6.38. Original patches were submitted by Tom Herbert
515*e6e37f63SOtto Sabart(therbert@google.com)
516*e6e37f63SOtto Sabart
517*e6e37f63SOtto SabartAccelerated RFS was introduced in 2.6.35. Original patches were
518*e6e37f63SOtto Sabartsubmitted by Ben Hutchings (bwh@kernel.org)
519*e6e37f63SOtto Sabart
520*e6e37f63SOtto SabartAuthors:
521*e6e37f63SOtto Sabart
522*e6e37f63SOtto Sabart- Tom Herbert (therbert@google.com)
523*e6e37f63SOtto Sabart- Willem de Bruijn (willemb@google.com)
524