1e6e37f63SOtto Sabart.. SPDX-License-Identifier: GPL-2.0 2e6e37f63SOtto Sabart 3e6e37f63SOtto Sabart===================================== 4e6e37f63SOtto SabartScaling in the Linux Networking Stack 5e6e37f63SOtto Sabart===================================== 6e6e37f63SOtto Sabart 7e6e37f63SOtto Sabart 8e6e37f63SOtto SabartIntroduction 9e6e37f63SOtto Sabart============ 10e6e37f63SOtto Sabart 11e6e37f63SOtto SabartThis document describes a set of complementary techniques in the Linux 12e6e37f63SOtto Sabartnetworking stack to increase parallelism and improve performance for 13e6e37f63SOtto Sabartmulti-processor systems. 14e6e37f63SOtto Sabart 15e6e37f63SOtto SabartThe following technologies are described: 16e6e37f63SOtto Sabart 17e6e37f63SOtto Sabart- RSS: Receive Side Scaling 18e6e37f63SOtto Sabart- RPS: Receive Packet Steering 19e6e37f63SOtto Sabart- RFS: Receive Flow Steering 20e6e37f63SOtto Sabart- Accelerated Receive Flow Steering 21e6e37f63SOtto Sabart- XPS: Transmit Packet Steering 22e6e37f63SOtto Sabart 23e6e37f63SOtto Sabart 24e6e37f63SOtto SabartRSS: Receive Side Scaling 25e6e37f63SOtto Sabart========================= 26e6e37f63SOtto Sabart 27e6e37f63SOtto SabartContemporary NICs support multiple receive and transmit descriptor queues 28e6e37f63SOtto Sabart(multi-queue). On reception, a NIC can send different packets to different 29e6e37f63SOtto Sabartqueues to distribute processing among CPUs. The NIC distributes packets by 30e6e37f63SOtto Sabartapplying a filter to each packet that assigns it to one of a small number 31e6e37f63SOtto Sabartof logical flows. Packets for each flow are steered to a separate receive 32e6e37f63SOtto Sabartqueue, which in turn can be processed by separate CPUs. This mechanism is 33e6e37f63SOtto Sabartgenerally known as “Receive-side Scaling” (RSS). The goal of RSS and 34e6e37f63SOtto Sabartthe other scaling techniques is to increase performance uniformly. 35e6e37f63SOtto SabartMulti-queue distribution can also be used for traffic prioritization, but 36e6e37f63SOtto Sabartthat is not the focus of these techniques. 37e6e37f63SOtto Sabart 38e6e37f63SOtto SabartThe filter used in RSS is typically a hash function over the network 39e6e37f63SOtto Sabartand/or transport layer headers-- for example, a 4-tuple hash over 40e6e37f63SOtto SabartIP addresses and TCP ports of a packet. The most common hardware 41e6e37f63SOtto Sabartimplementation of RSS uses a 128-entry indirection table where each entry 42e6e37f63SOtto Sabartstores a queue number. The receive queue for a packet is determined 43e6e37f63SOtto Sabartby masking out the low order seven bits of the computed hash for the 44e6e37f63SOtto Sabartpacket (usually a Toeplitz hash), taking this number as a key into the 45e6e37f63SOtto Sabartindirection table and reading the corresponding value. 46e6e37f63SOtto Sabart 47e6e37f63SOtto SabartSome advanced NICs allow steering packets to queues based on 48e6e37f63SOtto Sabartprogrammable filters. For example, webserver bound TCP port 80 packets 49e6e37f63SOtto Sabartcan be directed to their own receive queue. Such “n-tuple” filters can 50e6e37f63SOtto Sabartbe configured from ethtool (--config-ntuple). 51e6e37f63SOtto Sabart 52e6e37f63SOtto Sabart 53e6e37f63SOtto SabartRSS Configuration 54e6e37f63SOtto Sabart----------------- 55e6e37f63SOtto Sabart 56e6e37f63SOtto SabartThe driver for a multi-queue capable NIC typically provides a kernel 57e6e37f63SOtto Sabartmodule parameter for specifying the number of hardware queues to 58e6e37f63SOtto Sabartconfigure. In the bnx2x driver, for instance, this parameter is called 59e6e37f63SOtto Sabartnum_queues. A typical RSS configuration would be to have one receive queue 60e6e37f63SOtto Sabartfor each CPU if the device supports enough queues, or otherwise at least 61e6e37f63SOtto Sabartone for each memory domain, where a memory domain is a set of CPUs that 62e6e37f63SOtto Sabartshare a particular memory level (L1, L2, NUMA node, etc.). 63e6e37f63SOtto Sabart 64e6e37f63SOtto SabartThe indirection table of an RSS device, which resolves a queue by masked 65e6e37f63SOtto Sabarthash, is usually programmed by the driver at initialization. The 66e6e37f63SOtto Sabartdefault mapping is to distribute the queues evenly in the table, but the 67e6e37f63SOtto Sabartindirection table can be retrieved and modified at runtime using ethtool 68e6e37f63SOtto Sabartcommands (--show-rxfh-indir and --set-rxfh-indir). Modifying the 69e6e37f63SOtto Sabartindirection table could be done to give different queues different 70e6e37f63SOtto Sabartrelative weights. 71e6e37f63SOtto Sabart 72e6e37f63SOtto Sabart 73e6e37f63SOtto SabartRSS IRQ Configuration 74e6e37f63SOtto Sabart~~~~~~~~~~~~~~~~~~~~~ 75e6e37f63SOtto Sabart 76e6e37f63SOtto SabartEach receive queue has a separate IRQ associated with it. The NIC triggers 77e6e37f63SOtto Sabartthis to notify a CPU when new packets arrive on the given queue. The 78e6e37f63SOtto Sabartsignaling path for PCIe devices uses message signaled interrupts (MSI-X), 79e6e37f63SOtto Sabartthat can route each interrupt to a particular CPU. The active mapping 80e6e37f63SOtto Sabartof queues to IRQs can be determined from /proc/interrupts. By default, 81e6e37f63SOtto Sabartan IRQ may be handled on any CPU. Because a non-negligible part of packet 82e6e37f63SOtto Sabartprocessing takes place in receive interrupt handling, it is advantageous 83e6e37f63SOtto Sabartto spread receive interrupts between CPUs. To manually adjust the IRQ 84e00b0ab8SMauro Carvalho Chehabaffinity of each interrupt see Documentation/core-api/irq/irq-affinity.rst. Some systems 85e6e37f63SOtto Sabartwill be running irqbalance, a daemon that dynamically optimizes IRQ 86e6e37f63SOtto Sabartassignments and as a result may override any manual settings. 87e6e37f63SOtto Sabart 88e6e37f63SOtto Sabart 89e6e37f63SOtto SabartSuggested Configuration 90e6e37f63SOtto Sabart~~~~~~~~~~~~~~~~~~~~~~~ 91e6e37f63SOtto Sabart 92e6e37f63SOtto SabartRSS should be enabled when latency is a concern or whenever receive 93e6e37f63SOtto Sabartinterrupt processing forms a bottleneck. Spreading load between CPUs 94e6e37f63SOtto Sabartdecreases queue length. For low latency networking, the optimal setting 95e6e37f63SOtto Sabartis to allocate as many queues as there are CPUs in the system (or the 96e6e37f63SOtto SabartNIC maximum, if lower). The most efficient high-rate configuration 97e6e37f63SOtto Sabartis likely the one with the smallest number of receive queues where no 98e6e37f63SOtto Sabartreceive queue overflows due to a saturated CPU, because in default 99e6e37f63SOtto Sabartmode with interrupt coalescing enabled, the aggregate number of 100e6e37f63SOtto Sabartinterrupts (and thus work) grows with each additional queue. 101e6e37f63SOtto Sabart 102e6e37f63SOtto SabartPer-cpu load can be observed using the mpstat utility, but note that on 103e6e37f63SOtto Sabartprocessors with hyperthreading (HT), each hyperthread is represented as 104e6e37f63SOtto Sabarta separate CPU. For interrupt handling, HT has shown no benefit in 105e6e37f63SOtto Sabartinitial tests, so limit the number of queues to the number of CPU cores 106e6e37f63SOtto Sabartin the system. 107e6e37f63SOtto Sabart 108e6e37f63SOtto Sabart 109e6e37f63SOtto SabartRPS: Receive Packet Steering 110e6e37f63SOtto Sabart============================ 111e6e37f63SOtto Sabart 112e6e37f63SOtto SabartReceive Packet Steering (RPS) is logically a software implementation of 113e6e37f63SOtto SabartRSS. Being in software, it is necessarily called later in the datapath. 114e6e37f63SOtto SabartWhereas RSS selects the queue and hence CPU that will run the hardware 115e6e37f63SOtto Sabartinterrupt handler, RPS selects the CPU to perform protocol processing 116e6e37f63SOtto Sabartabove the interrupt handler. This is accomplished by placing the packet 117e6e37f63SOtto Sabarton the desired CPU’s backlog queue and waking up the CPU for processing. 118e6e37f63SOtto SabartRPS has some advantages over RSS: 119e6e37f63SOtto Sabart 120e6e37f63SOtto Sabart1) it can be used with any NIC 121e6e37f63SOtto Sabart2) software filters can easily be added to hash over new protocols 122e6e37f63SOtto Sabart3) it does not increase hardware device interrupt rate (although it does 123e6e37f63SOtto Sabart introduce inter-processor interrupts (IPIs)) 124e6e37f63SOtto Sabart 125e6e37f63SOtto SabartRPS is called during bottom half of the receive interrupt handler, when 126e6e37f63SOtto Sabarta driver sends a packet up the network stack with netif_rx() or 127e6e37f63SOtto Sabartnetif_receive_skb(). These call the get_rps_cpu() function, which 128e6e37f63SOtto Sabartselects the queue that should process a packet. 129e6e37f63SOtto Sabart 130e6e37f63SOtto SabartThe first step in determining the target CPU for RPS is to calculate a 131e6e37f63SOtto Sabartflow hash over the packet’s addresses or ports (2-tuple or 4-tuple hash 132e6e37f63SOtto Sabartdepending on the protocol). This serves as a consistent hash of the 133e6e37f63SOtto Sabartassociated flow of the packet. The hash is either provided by hardware 134e6e37f63SOtto Sabartor will be computed in the stack. Capable hardware can pass the hash in 135e6e37f63SOtto Sabartthe receive descriptor for the packet; this would usually be the same 136e6e37f63SOtto Sabarthash used for RSS (e.g. computed Toeplitz hash). The hash is saved in 137e6e37f63SOtto Sabartskb->hash and can be used elsewhere in the stack as a hash of the 138e6e37f63SOtto Sabartpacket’s flow. 139e6e37f63SOtto Sabart 140e6e37f63SOtto SabartEach receive hardware queue has an associated list of CPUs to which 141e6e37f63SOtto SabartRPS may enqueue packets for processing. For each received packet, 142e6e37f63SOtto Sabartan index into the list is computed from the flow hash modulo the size 143e6e37f63SOtto Sabartof the list. The indexed CPU is the target for processing the packet, 144e6e37f63SOtto Sabartand the packet is queued to the tail of that CPU’s backlog queue. At 145e6e37f63SOtto Sabartthe end of the bottom half routine, IPIs are sent to any CPUs for which 146e6e37f63SOtto Sabartpackets have been queued to their backlog queue. The IPI wakes backlog 147e6e37f63SOtto Sabartprocessing on the remote CPU, and any queued packets are then processed 148e6e37f63SOtto Sabartup the networking stack. 149e6e37f63SOtto Sabart 150e6e37f63SOtto Sabart 151e6e37f63SOtto SabartRPS Configuration 152e6e37f63SOtto Sabart----------------- 153e6e37f63SOtto Sabart 154e6e37f63SOtto SabartRPS requires a kernel compiled with the CONFIG_RPS kconfig symbol (on 155e6e37f63SOtto Sabartby default for SMP). Even when compiled in, RPS remains disabled until 156e6e37f63SOtto Sabartexplicitly configured. The list of CPUs to which RPS may forward traffic 157e6e37f63SOtto Sabartcan be configured for each receive queue using a sysfs file entry:: 158e6e37f63SOtto Sabart 159e6e37f63SOtto Sabart /sys/class/net/<dev>/queues/rx-<n>/rps_cpus 160e6e37f63SOtto Sabart 161e6e37f63SOtto SabartThis file implements a bitmap of CPUs. RPS is disabled when it is zero 162e6e37f63SOtto Sabart(the default), in which case packets are processed on the interrupting 163e00b0ab8SMauro Carvalho ChehabCPU. Documentation/core-api/irq/irq-affinity.rst explains how CPUs are assigned to 164e6e37f63SOtto Sabartthe bitmap. 165e6e37f63SOtto Sabart 166e6e37f63SOtto Sabart 167e6e37f63SOtto SabartSuggested Configuration 168e6e37f63SOtto Sabart~~~~~~~~~~~~~~~~~~~~~~~ 169e6e37f63SOtto Sabart 170e6e37f63SOtto SabartFor a single queue device, a typical RPS configuration would be to set 171e6e37f63SOtto Sabartthe rps_cpus to the CPUs in the same memory domain of the interrupting 172e6e37f63SOtto SabartCPU. If NUMA locality is not an issue, this could also be all CPUs in 173e6e37f63SOtto Sabartthe system. At high interrupt rate, it might be wise to exclude the 174e6e37f63SOtto Sabartinterrupting CPU from the map since that already performs much work. 175e6e37f63SOtto Sabart 176e6e37f63SOtto SabartFor a multi-queue system, if RSS is configured so that a hardware 177e6e37f63SOtto Sabartreceive queue is mapped to each CPU, then RPS is probably redundant 178e6e37f63SOtto Sabartand unnecessary. If there are fewer hardware queues than CPUs, then 179e6e37f63SOtto SabartRPS might be beneficial if the rps_cpus for each queue are the ones that 180e6e37f63SOtto Sabartshare the same memory domain as the interrupting CPU for that queue. 181e6e37f63SOtto Sabart 182e6e37f63SOtto Sabart 183e6e37f63SOtto SabartRPS Flow Limit 184e6e37f63SOtto Sabart-------------- 185e6e37f63SOtto Sabart 186e6e37f63SOtto SabartRPS scales kernel receive processing across CPUs without introducing 187e6e37f63SOtto Sabartreordering. The trade-off to sending all packets from the same flow 188e6e37f63SOtto Sabartto the same CPU is CPU load imbalance if flows vary in packet rate. 189e6e37f63SOtto SabartIn the extreme case a single flow dominates traffic. Especially on 190e6e37f63SOtto Sabartcommon server workloads with many concurrent connections, such 191e6e37f63SOtto Sabartbehavior indicates a problem such as a misconfiguration or spoofed 192e6e37f63SOtto Sabartsource Denial of Service attack. 193e6e37f63SOtto Sabart 194e6e37f63SOtto SabartFlow Limit is an optional RPS feature that prioritizes small flows 195e6e37f63SOtto Sabartduring CPU contention by dropping packets from large flows slightly 196e6e37f63SOtto Sabartahead of those from small flows. It is active only when an RPS or RFS 197e6e37f63SOtto Sabartdestination CPU approaches saturation. Once a CPU's input packet 198e6e37f63SOtto Sabartqueue exceeds half the maximum queue length (as set by sysctl 199e6e37f63SOtto Sabartnet.core.netdev_max_backlog), the kernel starts a per-flow packet 200e6e37f63SOtto Sabartcount over the last 256 packets. If a flow exceeds a set ratio (by 201e6e37f63SOtto Sabartdefault, half) of these packets when a new packet arrives, then the 202e6e37f63SOtto Sabartnew packet is dropped. Packets from other flows are still only 203e6e37f63SOtto Sabartdropped once the input packet queue reaches netdev_max_backlog. 204e6e37f63SOtto SabartNo packets are dropped when the input packet queue length is below 205e6e37f63SOtto Sabartthe threshold, so flow limit does not sever connections outright: 206e6e37f63SOtto Sabarteven large flows maintain connectivity. 207e6e37f63SOtto Sabart 208e6e37f63SOtto Sabart 209e6e37f63SOtto SabartInterface 210e6e37f63SOtto Sabart~~~~~~~~~ 211e6e37f63SOtto Sabart 212e6e37f63SOtto SabartFlow limit is compiled in by default (CONFIG_NET_FLOW_LIMIT), but not 213e6e37f63SOtto Sabartturned on. It is implemented for each CPU independently (to avoid lock 214e6e37f63SOtto Sabartand cache contention) and toggled per CPU by setting the relevant bit 215e6e37f63SOtto Sabartin sysctl net.core.flow_limit_cpu_bitmap. It exposes the same CPU 216e6e37f63SOtto Sabartbitmap interface as rps_cpus (see above) when called from procfs:: 217e6e37f63SOtto Sabart 218e6e37f63SOtto Sabart /proc/sys/net/core/flow_limit_cpu_bitmap 219e6e37f63SOtto Sabart 220e6e37f63SOtto SabartPer-flow rate is calculated by hashing each packet into a hashtable 221e6e37f63SOtto Sabartbucket and incrementing a per-bucket counter. The hash function is 222e6e37f63SOtto Sabartthe same that selects a CPU in RPS, but as the number of buckets can 223e6e37f63SOtto Sabartbe much larger than the number of CPUs, flow limit has finer-grained 224e6e37f63SOtto Sabartidentification of large flows and fewer false positives. The default 225e6e37f63SOtto Sabarttable has 4096 buckets. This value can be modified through sysctl:: 226e6e37f63SOtto Sabart 227e6e37f63SOtto Sabart net.core.flow_limit_table_len 228e6e37f63SOtto Sabart 229e6e37f63SOtto SabartThe value is only consulted when a new table is allocated. Modifying 230e6e37f63SOtto Sabartit does not update active tables. 231e6e37f63SOtto Sabart 232e6e37f63SOtto Sabart 233e6e37f63SOtto SabartSuggested Configuration 234e6e37f63SOtto Sabart~~~~~~~~~~~~~~~~~~~~~~~ 235e6e37f63SOtto Sabart 236e6e37f63SOtto SabartFlow limit is useful on systems with many concurrent connections, 237e6e37f63SOtto Sabartwhere a single connection taking up 50% of a CPU indicates a problem. 238e6e37f63SOtto SabartIn such environments, enable the feature on all CPUs that handle 239e6e37f63SOtto Sabartnetwork rx interrupts (as set in /proc/irq/N/smp_affinity). 240e6e37f63SOtto Sabart 241e6e37f63SOtto SabartThe feature depends on the input packet queue length to exceed 242e6e37f63SOtto Sabartthe flow limit threshold (50%) + the flow history length (256). 243e6e37f63SOtto SabartSetting net.core.netdev_max_backlog to either 1000 or 10000 244e6e37f63SOtto Sabartperformed well in experiments. 245e6e37f63SOtto Sabart 246e6e37f63SOtto Sabart 247e6e37f63SOtto SabartRFS: Receive Flow Steering 248e6e37f63SOtto Sabart========================== 249e6e37f63SOtto Sabart 250e6e37f63SOtto SabartWhile RPS steers packets solely based on hash, and thus generally 251e6e37f63SOtto Sabartprovides good load distribution, it does not take into account 252e6e37f63SOtto Sabartapplication locality. This is accomplished by Receive Flow Steering 253e6e37f63SOtto Sabart(RFS). The goal of RFS is to increase datacache hitrate by steering 254e6e37f63SOtto Sabartkernel processing of packets to the CPU where the application thread 255e6e37f63SOtto Sabartconsuming the packet is running. RFS relies on the same RPS mechanisms 256e6e37f63SOtto Sabartto enqueue packets onto the backlog of another CPU and to wake up that 257e6e37f63SOtto SabartCPU. 258e6e37f63SOtto Sabart 259e6e37f63SOtto SabartIn RFS, packets are not forwarded directly by the value of their hash, 260e6e37f63SOtto Sabartbut the hash is used as index into a flow lookup table. This table maps 261e6e37f63SOtto Sabartflows to the CPUs where those flows are being processed. The flow hash 262e6e37f63SOtto Sabart(see RPS section above) is used to calculate the index into this table. 263e6e37f63SOtto SabartThe CPU recorded in each entry is the one which last processed the flow. 264e6e37f63SOtto SabartIf an entry does not hold a valid CPU, then packets mapped to that entry 265e6e37f63SOtto Sabartare steered using plain RPS. Multiple table entries may point to the 266e6e37f63SOtto Sabartsame CPU. Indeed, with many flows and few CPUs, it is very likely that 267e6e37f63SOtto Sabarta single application thread handles flows with many different flow hashes. 268e6e37f63SOtto Sabart 269e6e37f63SOtto Sabartrps_sock_flow_table is a global flow table that contains the *desired* CPU 270e6e37f63SOtto Sabartfor flows: the CPU that is currently processing the flow in userspace. 271e6e37f63SOtto SabartEach table value is a CPU index that is updated during calls to recvmsg 272*dc97391eSDavid Howellsand sendmsg (specifically, inet_recvmsg(), inet_sendmsg() and 273*dc97391eSDavid Howellstcp_splice_read()). 274e6e37f63SOtto Sabart 275e6e37f63SOtto SabartWhen the scheduler moves a thread to a new CPU while it has outstanding 276e6e37f63SOtto Sabartreceive packets on the old CPU, packets may arrive out of order. To 277e6e37f63SOtto Sabartavoid this, RFS uses a second flow table to track outstanding packets 278e6e37f63SOtto Sabartfor each flow: rps_dev_flow_table is a table specific to each hardware 279e6e37f63SOtto Sabartreceive queue of each device. Each table value stores a CPU index and a 280e6e37f63SOtto Sabartcounter. The CPU index represents the *current* CPU onto which packets 281e6e37f63SOtto Sabartfor this flow are enqueued for further kernel processing. Ideally, kernel 282e6e37f63SOtto Sabartand userspace processing occur on the same CPU, and hence the CPU index 283e6e37f63SOtto Sabartin both tables is identical. This is likely false if the scheduler has 284e6e37f63SOtto Sabartrecently migrated a userspace thread while the kernel still has packets 285e6e37f63SOtto Sabartenqueued for kernel processing on the old CPU. 286e6e37f63SOtto Sabart 287e6e37f63SOtto SabartThe counter in rps_dev_flow_table values records the length of the current 288e6e37f63SOtto SabartCPU's backlog when a packet in this flow was last enqueued. Each backlog 289e6e37f63SOtto Sabartqueue has a head counter that is incremented on dequeue. A tail counter 290e6e37f63SOtto Sabartis computed as head counter + queue length. In other words, the counter 291e6e37f63SOtto Sabartin rps_dev_flow[i] records the last element in flow i that has 292e6e37f63SOtto Sabartbeen enqueued onto the currently designated CPU for flow i (of course, 293e6e37f63SOtto Sabartentry i is actually selected by hash and multiple flows may hash to the 294e6e37f63SOtto Sabartsame entry i). 295e6e37f63SOtto Sabart 296e6e37f63SOtto SabartAnd now the trick for avoiding out of order packets: when selecting the 297e6e37f63SOtto SabartCPU for packet processing (from get_rps_cpu()) the rps_sock_flow table 298e6e37f63SOtto Sabartand the rps_dev_flow table of the queue that the packet was received on 299e6e37f63SOtto Sabartare compared. If the desired CPU for the flow (found in the 300e6e37f63SOtto Sabartrps_sock_flow table) matches the current CPU (found in the rps_dev_flow 301e6e37f63SOtto Sabarttable), the packet is enqueued onto that CPU’s backlog. If they differ, 302e6e37f63SOtto Sabartthe current CPU is updated to match the desired CPU if one of the 303e6e37f63SOtto Sabartfollowing is true: 304e6e37f63SOtto Sabart 305e6e37f63SOtto Sabart - The current CPU's queue head counter >= the recorded tail counter 306e6e37f63SOtto Sabart value in rps_dev_flow[i] 307e6e37f63SOtto Sabart - The current CPU is unset (>= nr_cpu_ids) 308e6e37f63SOtto Sabart - The current CPU is offline 309e6e37f63SOtto Sabart 310e6e37f63SOtto SabartAfter this check, the packet is sent to the (possibly updated) current 311e6e37f63SOtto SabartCPU. These rules aim to ensure that a flow only moves to a new CPU when 312e6e37f63SOtto Sabartthere are no packets outstanding on the old CPU, as the outstanding 313e6e37f63SOtto Sabartpackets could arrive later than those about to be processed on the new 314e6e37f63SOtto SabartCPU. 315e6e37f63SOtto Sabart 316e6e37f63SOtto Sabart 317e6e37f63SOtto SabartRFS Configuration 318e6e37f63SOtto Sabart----------------- 319e6e37f63SOtto Sabart 320e6e37f63SOtto SabartRFS is only available if the kconfig symbol CONFIG_RPS is enabled (on 321e6e37f63SOtto Sabartby default for SMP). The functionality remains disabled until explicitly 322e6e37f63SOtto Sabartconfigured. The number of entries in the global flow table is set through:: 323e6e37f63SOtto Sabart 324e6e37f63SOtto Sabart /proc/sys/net/core/rps_sock_flow_entries 325e6e37f63SOtto Sabart 326e6e37f63SOtto SabartThe number of entries in the per-queue flow table are set through:: 327e6e37f63SOtto Sabart 328e6e37f63SOtto Sabart /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt 329e6e37f63SOtto Sabart 330e6e37f63SOtto Sabart 331e6e37f63SOtto SabartSuggested Configuration 332e6e37f63SOtto Sabart~~~~~~~~~~~~~~~~~~~~~~~ 333e6e37f63SOtto Sabart 334e6e37f63SOtto SabartBoth of these need to be set before RFS is enabled for a receive queue. 335e6e37f63SOtto SabartValues for both are rounded up to the nearest power of two. The 336e6e37f63SOtto Sabartsuggested flow count depends on the expected number of active connections 337e6e37f63SOtto Sabartat any given time, which may be significantly less than the number of open 338e6e37f63SOtto Sabartconnections. We have found that a value of 32768 for rps_sock_flow_entries 339e6e37f63SOtto Sabartworks fairly well on a moderately loaded server. 340e6e37f63SOtto Sabart 341e6e37f63SOtto SabartFor a single queue device, the rps_flow_cnt value for the single queue 342e6e37f63SOtto Sabartwould normally be configured to the same value as rps_sock_flow_entries. 343e6e37f63SOtto SabartFor a multi-queue device, the rps_flow_cnt for each queue might be 344e6e37f63SOtto Sabartconfigured as rps_sock_flow_entries / N, where N is the number of 345e6e37f63SOtto Sabartqueues. So for instance, if rps_sock_flow_entries is set to 32768 and there 346e6e37f63SOtto Sabartare 16 configured receive queues, rps_flow_cnt for each queue might be 347e6e37f63SOtto Sabartconfigured as 2048. 348e6e37f63SOtto Sabart 349e6e37f63SOtto Sabart 350e6e37f63SOtto SabartAccelerated RFS 351e6e37f63SOtto Sabart=============== 352e6e37f63SOtto Sabart 353e6e37f63SOtto SabartAccelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated load 354e6e37f63SOtto Sabartbalancing mechanism that uses soft state to steer flows based on where 355e6e37f63SOtto Sabartthe application thread consuming the packets of each flow is running. 356e6e37f63SOtto SabartAccelerated RFS should perform better than RFS since packets are sent 357e6e37f63SOtto Sabartdirectly to a CPU local to the thread consuming the data. The target CPU 358e6e37f63SOtto Sabartwill either be the same CPU where the application runs, or at least a CPU 359e6e37f63SOtto Sabartwhich is local to the application thread’s CPU in the cache hierarchy. 360e6e37f63SOtto Sabart 361e6e37f63SOtto SabartTo enable accelerated RFS, the networking stack calls the 362e6e37f63SOtto Sabartndo_rx_flow_steer driver function to communicate the desired hardware 363e6e37f63SOtto Sabartqueue for packets matching a particular flow. The network stack 364e6e37f63SOtto Sabartautomatically calls this function every time a flow entry in 365e6e37f63SOtto Sabartrps_dev_flow_table is updated. The driver in turn uses a device specific 366e6e37f63SOtto Sabartmethod to program the NIC to steer the packets. 367e6e37f63SOtto Sabart 368e6e37f63SOtto SabartThe hardware queue for a flow is derived from the CPU recorded in 369e6e37f63SOtto Sabartrps_dev_flow_table. The stack consults a CPU to hardware queue map which 370e6e37f63SOtto Sabartis maintained by the NIC driver. This is an auto-generated reverse map of 371e6e37f63SOtto Sabartthe IRQ affinity table shown by /proc/interrupts. Drivers can use 372e6e37f63SOtto Sabartfunctions in the cpu_rmap (“CPU affinity reverse map”) kernel library 373e6e37f63SOtto Sabartto populate the map. For each CPU, the corresponding queue in the map is 374e6e37f63SOtto Sabartset to be one whose processing CPU is closest in cache locality. 375e6e37f63SOtto Sabart 376e6e37f63SOtto Sabart 377e6e37f63SOtto SabartAccelerated RFS Configuration 378e6e37f63SOtto Sabart----------------------------- 379e6e37f63SOtto Sabart 380e6e37f63SOtto SabartAccelerated RFS is only available if the kernel is compiled with 381e6e37f63SOtto SabartCONFIG_RFS_ACCEL and support is provided by the NIC device and driver. 382e6e37f63SOtto SabartIt also requires that ntuple filtering is enabled via ethtool. The map 383e6e37f63SOtto Sabartof CPU to queues is automatically deduced from the IRQ affinities 384e6e37f63SOtto Sabartconfigured for each receive queue by the driver, so no additional 385e6e37f63SOtto Sabartconfiguration should be necessary. 386e6e37f63SOtto Sabart 387e6e37f63SOtto Sabart 388e6e37f63SOtto SabartSuggested Configuration 389e6e37f63SOtto Sabart~~~~~~~~~~~~~~~~~~~~~~~ 390e6e37f63SOtto Sabart 391e6e37f63SOtto SabartThis technique should be enabled whenever one wants to use RFS and the 392e6e37f63SOtto SabartNIC supports hardware acceleration. 393e6e37f63SOtto Sabart 394e6e37f63SOtto Sabart 395e6e37f63SOtto SabartXPS: Transmit Packet Steering 396e6e37f63SOtto Sabart============================= 397e6e37f63SOtto Sabart 398e6e37f63SOtto SabartTransmit Packet Steering is a mechanism for intelligently selecting 399e6e37f63SOtto Sabartwhich transmit queue to use when transmitting a packet on a multi-queue 400e6e37f63SOtto Sabartdevice. This can be accomplished by recording two kinds of maps, either 401e6e37f63SOtto Sabarta mapping of CPU to hardware queue(s) or a mapping of receive queue(s) 402e6e37f63SOtto Sabartto hardware transmit queue(s). 403e6e37f63SOtto Sabart 404e6e37f63SOtto Sabart1. XPS using CPUs map 405e6e37f63SOtto Sabart 406e6e37f63SOtto SabartThe goal of this mapping is usually to assign queues 407e6e37f63SOtto Sabartexclusively to a subset of CPUs, where the transmit completions for 408e6e37f63SOtto Sabartthese queues are processed on a CPU within this set. This choice 409e6e37f63SOtto Sabartprovides two benefits. First, contention on the device queue lock is 410e6e37f63SOtto Sabartsignificantly reduced since fewer CPUs contend for the same queue 411e6e37f63SOtto Sabart(contention can be eliminated completely if each CPU has its own 412e6e37f63SOtto Sabarttransmit queue). Secondly, cache miss rate on transmit completion is 413e6e37f63SOtto Sabartreduced, in particular for data cache lines that hold the sk_buff 414e6e37f63SOtto Sabartstructures. 415e6e37f63SOtto Sabart 416e6e37f63SOtto Sabart2. XPS using receive queues map 417e6e37f63SOtto Sabart 418e6e37f63SOtto SabartThis mapping is used to pick transmit queue based on the receive 419e6e37f63SOtto Sabartqueue(s) map configuration set by the administrator. A set of receive 420e6e37f63SOtto Sabartqueues can be mapped to a set of transmit queues (many:many), although 421e6e37f63SOtto Sabartthe common use case is a 1:1 mapping. This will enable sending packets 422e6e37f63SOtto Sabarton the same queue associations for transmit and receive. This is useful for 423e6e37f63SOtto Sabartbusy polling multi-threaded workloads where there are challenges in 424e6e37f63SOtto Sabartassociating a given CPU to a given application thread. The application 425e6e37f63SOtto Sabartthreads are not pinned to CPUs and each thread handles packets 426e6e37f63SOtto Sabartreceived on a single queue. The receive queue number is cached in the 427e6e37f63SOtto Sabartsocket for the connection. In this model, sending the packets on the same 428e6e37f63SOtto Sabarttransmit queue corresponding to the associated receive queue has benefits 429e6e37f63SOtto Sabartin keeping the CPU overhead low. Transmit completion work is locked into 430e6e37f63SOtto Sabartthe same queue-association that a given application is polling on. This 431e6e37f63SOtto Sabartavoids the overhead of triggering an interrupt on another CPU. When the 432e6e37f63SOtto Sabartapplication cleans up the packets during the busy poll, transmit completion 433e6e37f63SOtto Sabartmay be processed along with it in the same thread context and so result in 434e6e37f63SOtto Sabartreduced latency. 435e6e37f63SOtto Sabart 436e6e37f63SOtto SabartXPS is configured per transmit queue by setting a bitmap of 437e6e37f63SOtto SabartCPUs/receive-queues that may use that queue to transmit. The reverse 438e6e37f63SOtto Sabartmapping, from CPUs to transmit queues or from receive-queues to transmit 439e6e37f63SOtto Sabartqueues, is computed and maintained for each network device. When 440e6e37f63SOtto Sabarttransmitting the first packet in a flow, the function get_xps_queue() is 441e6e37f63SOtto Sabartcalled to select a queue. This function uses the ID of the receive queue 442e6e37f63SOtto Sabartfor the socket connection for a match in the receive queue-to-transmit queue 443e6e37f63SOtto Sabartlookup table. Alternatively, this function can also use the ID of the 444e6e37f63SOtto Sabartrunning CPU as a key into the CPU-to-queue lookup table. If the 445e6e37f63SOtto SabartID matches a single queue, that is used for transmission. If multiple 446e6e37f63SOtto Sabartqueues match, one is selected by using the flow hash to compute an index 447e6e37f63SOtto Sabartinto the set. When selecting the transmit queue based on receive queue(s) 448e6e37f63SOtto Sabartmap, the transmit device is not validated against the receive device as it 449e6e37f63SOtto Sabartrequires expensive lookup operation in the datapath. 450e6e37f63SOtto Sabart 451e6e37f63SOtto SabartThe queue chosen for transmitting a particular flow is saved in the 452e6e37f63SOtto Sabartcorresponding socket structure for the flow (e.g. a TCP connection). 453e6e37f63SOtto SabartThis transmit queue is used for subsequent packets sent on the flow to 454e6e37f63SOtto Sabartprevent out of order (ooo) packets. The choice also amortizes the cost 455e6e37f63SOtto Sabartof calling get_xps_queues() over all packets in the flow. To avoid 456e6e37f63SOtto Sabartooo packets, the queue for a flow can subsequently only be changed if 457e6e37f63SOtto Sabartskb->ooo_okay is set for a packet in the flow. This flag indicates that 458e6e37f63SOtto Sabartthere are no outstanding packets in the flow, so the transmit queue can 459e6e37f63SOtto Sabartchange without the risk of generating out of order packets. The 460e6e37f63SOtto Sabarttransport layer is responsible for setting ooo_okay appropriately. TCP, 461e6e37f63SOtto Sabartfor instance, sets the flag when all data for a connection has been 462e6e37f63SOtto Sabartacknowledged. 463e6e37f63SOtto Sabart 464e6e37f63SOtto SabartXPS Configuration 465e6e37f63SOtto Sabart----------------- 466e6e37f63SOtto Sabart 467e6e37f63SOtto SabartXPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by 468254941f3SWillem de Bruijndefault for SMP). If compiled in, it is driver dependent whether, and 469254941f3SWillem de Bruijnhow, XPS is configured at device init. The mapping of CPUs/receive-queues 470254941f3SWillem de Bruijnto transmit queue can be inspected and configured using sysfs: 471e6e37f63SOtto Sabart 472e6e37f63SOtto SabartFor selection based on CPUs map:: 473e6e37f63SOtto Sabart 474e6e37f63SOtto Sabart /sys/class/net/<dev>/queues/tx-<n>/xps_cpus 475e6e37f63SOtto Sabart 476e6e37f63SOtto SabartFor selection based on receive-queues map:: 477e6e37f63SOtto Sabart 478e6e37f63SOtto Sabart /sys/class/net/<dev>/queues/tx-<n>/xps_rxqs 479e6e37f63SOtto Sabart 480e6e37f63SOtto Sabart 481e6e37f63SOtto SabartSuggested Configuration 482e6e37f63SOtto Sabart~~~~~~~~~~~~~~~~~~~~~~~ 483e6e37f63SOtto Sabart 484e6e37f63SOtto SabartFor a network device with a single transmission queue, XPS configuration 485e6e37f63SOtto Sabarthas no effect, since there is no choice in this case. In a multi-queue 486e6e37f63SOtto Sabartsystem, XPS is preferably configured so that each CPU maps onto one queue. 487e6e37f63SOtto SabartIf there are as many queues as there are CPUs in the system, then each 488e6e37f63SOtto Sabartqueue can also map onto one CPU, resulting in exclusive pairings that 489e6e37f63SOtto Sabartexperience no contention. If there are fewer queues than CPUs, then the 490e6e37f63SOtto Sabartbest CPUs to share a given queue are probably those that share the cache 491e6e37f63SOtto Sabartwith the CPU that processes transmit completions for that queue 492e6e37f63SOtto Sabart(transmit interrupts). 493e6e37f63SOtto Sabart 494e6e37f63SOtto SabartFor transmit queue selection based on receive queue(s), XPS has to be 495e6e37f63SOtto Sabartexplicitly configured mapping receive-queue(s) to transmit queue(s). If the 496e6e37f63SOtto Sabartuser configuration for receive-queue map does not apply, then the transmit 497e6e37f63SOtto Sabartqueue is selected based on the CPUs map. 498e6e37f63SOtto Sabart 499e6e37f63SOtto Sabart 500e6e37f63SOtto SabartPer TX Queue rate limitation 501e6e37f63SOtto Sabart============================ 502e6e37f63SOtto Sabart 503e6e37f63SOtto SabartThese are rate-limitation mechanisms implemented by HW, where currently 504e6e37f63SOtto Sabarta max-rate attribute is supported, by setting a Mbps value to:: 505e6e37f63SOtto Sabart 506e6e37f63SOtto Sabart /sys/class/net/<dev>/queues/tx-<n>/tx_maxrate 507e6e37f63SOtto Sabart 508e6e37f63SOtto SabartA value of zero means disabled, and this is the default. 509e6e37f63SOtto Sabart 510e6e37f63SOtto Sabart 511e6e37f63SOtto SabartFurther Information 512e6e37f63SOtto Sabart=================== 513e6e37f63SOtto SabartRPS and RFS were introduced in kernel 2.6.35. XPS was incorporated into 514e6e37f63SOtto Sabart2.6.38. Original patches were submitted by Tom Herbert 515e6e37f63SOtto Sabart(therbert@google.com) 516e6e37f63SOtto Sabart 517e6e37f63SOtto SabartAccelerated RFS was introduced in 2.6.35. Original patches were 518e6e37f63SOtto Sabartsubmitted by Ben Hutchings (bwh@kernel.org) 519e6e37f63SOtto Sabart 520e6e37f63SOtto SabartAuthors: 521e6e37f63SOtto Sabart 522e6e37f63SOtto Sabart- Tom Herbert (therbert@google.com) 523e6e37f63SOtto Sabart- Willem de Bruijn (willemb@google.com) 524