Documentation/networking/scaling.rst

1 .. SPDX-License-Identifier: GPL-2.0
13 multi-processor systems.
17 - RSS: Receive Side Scaling
18 - RPS: Receive Packet Steering
19 - RFS: Receive Flow Steering
20 - Accelerated Receive Flow Steering
21 - XPS: Transmit Packet Steering
28 (multi-queue). On reception, a NIC can send different packets to different
33 generally known as “Receive-side Scaling” (RSS). The goal of RSS and
35 Multi-queue distribution can also be used for traffic prioritization, but
39 and/or transport layer headers-- for example, a 4-tuple hash over
41 implementation of RSS uses a 128-entry indirection table where each entry
49 can be directed to their own receive queue. Such “n-tuple” filters can
50 be configured from ethtool (--config-ntuple).
54 -----------------
56 The driver for a multi-queue capable NIC typically provides a kernel
60 for each CPU if the device supports enough queues, or otherwise at least
68 commands (--show-rxfh-indir and --set-rxfh-indir). Modifying the
77 this to notify a CPU when new packets arrive on the given queue. The
78 signaling path for PCIe devices uses message signaled interrupts (MSI-X),
79 that can route each interrupt to a particular CPU. The active mapping
81 an IRQ may be handled on any CPU. Because a non-negligible part of packet
84 affinity of each interrupt see Documentation/core-api/irq/irq-affinity.rst. Some systems
96 NIC maximum, if lower). The most efficient high-rate configuration
98 receive queue overflows due to a saturated CPU, because in default
102 Per-cpu load can be observed using the mpstat utility, but note that on
104 a separate CPU. For interrupt handling, HT has shown no benefit in
105 initial tests, so limit the number of queues to the number of CPU cores
114 Whereas RSS selects the queue and hence CPU that will run the hardware
115 interrupt handler, RPS selects the CPU to perform protocol processing
117 on the desired CPU’s backlog queue and waking up the CPU for processing.
123    introduce inter-processor interrupts (IPIs))
130 The first step in determining the target CPU for RPS is to calculate a
131 flow hash over the packet’s addresses or ports (2-tuple or 4-tuple hash
137 skb->hash and can be used elsewhere in the stack as a hash of the
143 of the list. The indexed CPU is the target for processing the packet,
144 and the packet is queued to the tail of that CPU’s backlog queue. At
147 processing on the remote CPU, and any queued packets are then processed
152 -----------------
159   /sys/class/net/<dev>/queues/rx-<n>/rps_cpus
163 CPU. Documentation/core-api/irq/irq-affinity.rst explains how CPUs are assigned to
170 For a single queue device, a typical RPS configuration would be to set
172 CPU. If NUMA locality is not an issue, this could also be all CPUs in
174 interrupting CPU from the map since that already performs much work.
176 For a multi-queue system, if RSS is configured so that a hardware
177 receive queue is mapped to each CPU, then RPS is probably redundant
180 share the same memory domain as the interrupting CPU for that queue.
184 --------------
187 reordering. The trade-off to sending all packets from the same flow
188 to the same CPU is CPU load imbalance if flows vary in packet rate.
189 In the extreme case a single flow dominates traffic. Especially on
195 during CPU contention by dropping packets from large flows slightly
197 destination CPU approaches saturation.  Once a CPU's input packet
199 net.core.netdev_max_backlog), the kernel starts a per-flow packet
213 turned on. It is implemented for each CPU independently (to avoid lock
214 and cache contention) and toggled per CPU by setting the relevant bit
215 in sysctl net.core.flow_limit_cpu_bitmap. It exposes the same CPU
220 Per-flow rate is calculated by hashing each packet into a hashtable
221 bucket and incrementing a per-bucket counter. The hash function is
222 the same that selects a CPU in RPS, but as the number of buckets can
223 be much larger than the number of CPUs, flow limit has finer-grained
237 where a single connection taking up 50% of a CPU indicates a problem.
254 kernel processing of packets to the CPU where the application thread
256 to enqueue packets onto the backlog of another CPU and to wake up that
257 CPU.
263 The CPU recorded in each entry is the one which last processed the flow.
264 If an entry does not hold a valid CPU, then packets mapped to that entry
266 same CPU. Indeed, with many flows and few CPUs, it is very likely that
267 a single application thread handles flows with many different flow hashes.
269 rps_sock_flow_table is a global flow table that contains the *desired* CPU
270 for flows: the CPU that is currently processing the flow in userspace.
271 Each table value is a CPU index that is updated during calls to recvmsg
275 When the scheduler moves a thread to a new CPU while it has outstanding
276 receive packets on the old CPU, packets may arrive out of order. To
279 receive queue of each device. Each table value stores a CPU index and a
280 counter. The CPU index represents the *current* CPU onto which packets
282 and userspace processing occur on the same CPU, and hence the CPU index
285 enqueued for kernel processing on the old CPU.
288 CPU's backlog when a packet in this flow was last enqueued. Each backlog
292 been enqueued onto the currently designated CPU for flow i (of course,
297 CPU for packet processing (from get_rps_cpu()) the rps_sock_flow table
299 are compared. If the desired CPU for the flow (found in the
300 rps_sock_flow table) matches the current CPU (found in the rps_dev_flow
301 table), the packet is enqueued onto that CPU’s backlog. If they differ,
302 the current CPU is updated to match the desired CPU if one of the
305   - The current CPU's queue head counter >= the recorded tail counter
307   - The current CPU is unset (>= nr_cpu_ids)
308   - The current CPU is offline
311 CPU. These rules aim to ensure that a flow only moves to a new CPU when
312 there are no packets outstanding on the old CPU, as the outstanding
314 CPU.
318 -----------------
326 The number of entries in the per-queue flow table are set through::
328   /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt
341 For a single queue device, the rps_flow_cnt value for the single queue
343 For a multi-queue device, the rps_flow_cnt for each queue might be
353 Accelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated load
357 directly to a CPU local to the thread consuming the data. The target CPU
358 will either be the same CPU where the application runs, or at least a CPU
359 which is local to the application thread’s CPU in the cache hierarchy.
368 The hardware queue for a flow is derived from the CPU recorded in
369 rps_dev_flow_table. The stack consults a CPU to hardware queue map which
370 is maintained by the NIC driver. This is an auto-generated reverse map of
372 functions in the cpu_rmap (“CPU affinity reverse map”) kernel library
373 to populate the map. For each CPU, the corresponding queue in the map is
374 set to be one whose processing CPU is closest in cache locality.
378 -----------------------------
383 of CPU to queues is automatically deduced from the IRQ affinities
399 which transmit queue to use when transmitting a packet on a multi-queue
401 a mapping of CPU to hardware queue(s) or a mapping of receive queue(s)
408 these queues are processed on a CPU within this set. This choice
411 (contention can be eliminated completely if each CPU has its own
423 busy polling multi-threaded workloads where there are challenges in
424 associating a given CPU to a given application thread. The application
426 received on a single queue. The receive queue number is cached in the
429 in keeping the CPU overhead low. Transmit completion work is locked into
430 the same queue-association that a given application is polling on. This
431 avoids the overhead of triggering an interrupt on another CPU. When the
437 CPUs/receive-queues that may use that queue to transmit. The reverse
438 mapping, from CPUs to transmit queues or from receive-queues to transmit
442 for the socket connection for a match in the receive queue-to-transmit queue
444 running CPU as a key into the CPU-to-queue lookup table. If the
445 ID matches a single queue, that is used for transmission. If multiple
457 skb->ooo_okay is set for a packet in the flow. This flag indicates that
465 -----------------
469 how, XPS is configured at device init. The mapping of CPUs/receive-queues
474   /sys/class/net/<dev>/queues/tx-<n>/xps_cpus
476 For selection based on receive-queues map::
478   /sys/class/net/<dev>/queues/tx-<n>/xps_rxqs
484 For a network device with a single transmission queue, XPS configuration
485 has no effect, since there is no choice in this case. In a multi-queue
486 system, XPS is preferably configured so that each CPU maps onto one queue.
488 queue can also map onto one CPU, resulting in exclusive pairings that
491 with the CPU that processes transmit completions for that queue
495 explicitly configured mapping receive-queue(s) to transmit queue(s). If the
496 user configuration for receive-queue map does not apply, then the transmit
503 These are rate-limitation mechanisms implemented by HW, where currently
504 a max-rate attribute is supported, by setting a Mbps value to::
506   /sys/class/net/<dev>/queues/tx-<n>/tx_maxrate
522 - Tom Herbert (therbert@google.com)
523 - Willem de Bruijn (willemb@google.com)