Documentation/networking/scaling.rst

e6e37f63SOtto Sabart.. SPDX-License-Identifier: GPL-2.0
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart=====================================
e6e37f63SOtto SabartScaling in the Linux Networking Stack
e6e37f63SOtto Sabart=====================================
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart
e6e37f63SOtto SabartIntroduction
e6e37f63SOtto Sabart============
e6e37f63SOtto Sabart
e6e37f63SOtto SabartThis document describes a set of complementary techniques in the Linux
e6e37f63SOtto Sabartnetworking stack to increase parallelism and improve performance for
e6e37f63SOtto Sabartmulti-processor systems.
e6e37f63SOtto Sabart
e6e37f63SOtto SabartThe following technologies are described:
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart- RSS: Receive Side Scaling
e6e37f63SOtto Sabart- RPS: Receive Packet Steering
e6e37f63SOtto Sabart- RFS: Receive Flow Steering
e6e37f63SOtto Sabart- Accelerated Receive Flow Steering
e6e37f63SOtto Sabart- XPS: Transmit Packet Steering
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart
e6e37f63SOtto SabartRSS: Receive Side Scaling
e6e37f63SOtto Sabart=========================
e6e37f63SOtto Sabart
e6e37f63SOtto SabartContemporary NICs support multiple receive and transmit descriptor queues
e6e37f63SOtto Sabart(multi-queue). On reception, a NIC can send different packets to different
e6e37f63SOtto Sabartqueues to distribute processing among CPUs. The NIC distributes packets by
e6e37f63SOtto Sabartapplying a filter to each packet that assigns it to one of a small number
e6e37f63SOtto Sabartof logical flows. Packets for each flow are steered to a separate receive
e6e37f63SOtto Sabartqueue, which in turn can be processed by separate CPUs. This mechanism is
e6e37f63SOtto Sabartgenerally known as “Receive-side Scaling” (RSS). The goal of RSS and
e6e37f63SOtto Sabartthe other scaling techniques is to increase performance uniformly.
e6e37f63SOtto SabartMulti-queue distribution can also be used for traffic prioritization, but
e6e37f63SOtto Sabartthat is not the focus of these techniques.
e6e37f63SOtto Sabart
e6e37f63SOtto SabartThe filter used in RSS is typically a hash function over the network
e6e37f63SOtto Sabartand/or transport layer headers-- for example, a 4-tuple hash over
e6e37f63SOtto SabartIP addresses and TCP ports of a packet. The most common hardware
e6e37f63SOtto Sabartimplementation of RSS uses a 128-entry indirection table where each entry
e6e37f63SOtto Sabartstores a queue number. The receive queue for a packet is determined
e6e37f63SOtto Sabartby masking out the low order seven bits of the computed hash for the
e6e37f63SOtto Sabartpacket (usually a Toeplitz hash), taking this number as a key into the
e6e37f63SOtto Sabartindirection table and reading the corresponding value.
e6e37f63SOtto Sabart
e6e37f63SOtto SabartSome advanced NICs allow steering packets to queues based on
e6e37f63SOtto Sabartprogrammable filters. For example, webserver bound TCP port 80 packets
e6e37f63SOtto Sabartcan be directed to their own receive queue. Such “n-tuple” filters can
e6e37f63SOtto Sabartbe configured from ethtool (--config-ntuple).
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart
e6e37f63SOtto SabartRSS Configuration
e6e37f63SOtto Sabart-----------------
e6e37f63SOtto Sabart
e6e37f63SOtto SabartThe driver for a multi-queue capable NIC typically provides a kernel
e6e37f63SOtto Sabartmodule parameter for specifying the number of hardware queues to
e6e37f63SOtto Sabartconfigure. In the bnx2x driver, for instance, this parameter is called
e6e37f63SOtto Sabartnum_queues. A typical RSS configuration would be to have one receive queue
e6e37f63SOtto Sabartfor each CPU if the device supports enough queues, or otherwise at least
e6e37f63SOtto Sabartone for each memory domain, where a memory domain is a set of CPUs that
e6e37f63SOtto Sabartshare a particular memory level (L1, L2, NUMA node, etc.).
e6e37f63SOtto Sabart
e6e37f63SOtto SabartThe indirection table of an RSS device, which resolves a queue by masked
e6e37f63SOtto Sabarthash, is usually programmed by the driver at initialization. The
e6e37f63SOtto Sabartdefault mapping is to distribute the queues evenly in the table, but the
e6e37f63SOtto Sabartindirection table can be retrieved and modified at runtime using ethtool
e6e37f63SOtto Sabartcommands (--show-rxfh-indir and --set-rxfh-indir). Modifying the
e6e37f63SOtto Sabartindirection table could be done to give different queues different
e6e37f63SOtto Sabartrelative weights.
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart
e6e37f63SOtto SabartRSS IRQ Configuration
e6e37f63SOtto Sabart~~~~~~~~~~~~~~~~~~~~~
e6e37f63SOtto Sabart
e6e37f63SOtto SabartEach receive queue has a separate IRQ associated with it. The NIC triggers
e6e37f63SOtto Sabartthis to notify a CPU when new packets arrive on the given queue. The
e6e37f63SOtto Sabartsignaling path for PCIe devices uses message signaled interrupts (MSI-X),
e6e37f63SOtto Sabartthat can route each interrupt to a particular CPU. The active mapping
e6e37f63SOtto Sabartof queues to IRQs can be determined from /proc/interrupts. By default,
e6e37f63SOtto Sabartan IRQ may be handled on any CPU. Because a non-negligible part of packet
e6e37f63SOtto Sabartprocessing takes place in receive interrupt handling, it is advantageous
e6e37f63SOtto Sabartto spread receive interrupts between CPUs. To manually adjust the IRQ
e00b0ab8SMauro Carvalho Chehabaffinity of each interrupt see Documentation/core-api/irq/irq-affinity.rst. Some systems
e6e37f63SOtto Sabartwill be running irqbalance, a daemon that dynamically optimizes IRQ
e6e37f63SOtto Sabartassignments and as a result may override any manual settings.
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart
e6e37f63SOtto SabartSuggested Configuration
e6e37f63SOtto Sabart~~~~~~~~~~~~~~~~~~~~~~~
e6e37f63SOtto Sabart
e6e37f63SOtto SabartRSS should be enabled when latency is a concern or whenever receive
e6e37f63SOtto Sabartinterrupt processing forms a bottleneck. Spreading load between CPUs
e6e37f63SOtto Sabartdecreases queue length. For low latency networking, the optimal setting
e6e37f63SOtto Sabartis to allocate as many queues as there are CPUs in the system (or the
e6e37f63SOtto SabartNIC maximum, if lower). The most efficient high-rate configuration
e6e37f63SOtto Sabartis likely the one with the smallest number of receive queues where no
e6e37f63SOtto Sabartreceive queue overflows due to a saturated CPU, because in default
e6e37f63SOtto Sabartmode with interrupt coalescing enabled, the aggregate number of
e6e37f63SOtto Sabartinterrupts (and thus work) grows with each additional queue.
e6e37f63SOtto Sabart
e6e37f63SOtto SabartPer-cpu load can be observed using the mpstat utility, but note that on
e6e37f63SOtto Sabartprocessors with hyperthreading (HT), each hyperthread is represented as
e6e37f63SOtto Sabarta separate CPU. For interrupt handling, HT has shown no benefit in
e6e37f63SOtto Sabartinitial tests, so limit the number of queues to the number of CPU cores
e6e37f63SOtto Sabartin the system.
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart
e6e37f63SOtto SabartRPS: Receive Packet Steering
e6e37f63SOtto Sabart============================
e6e37f63SOtto Sabart
e6e37f63SOtto SabartReceive Packet Steering (RPS) is logically a software implementation of
e6e37f63SOtto SabartRSS. Being in software, it is necessarily called later in the datapath.
e6e37f63SOtto SabartWhereas RSS selects the queue and hence CPU that will run the hardware
e6e37f63SOtto Sabartinterrupt handler, RPS selects the CPU to perform protocol processing
e6e37f63SOtto Sabartabove the interrupt handler. This is accomplished by placing the packet
e6e37f63SOtto Sabarton the desired CPU’s backlog queue and waking up the CPU for processing.
e6e37f63SOtto SabartRPS has some advantages over RSS:
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart1) it can be used with any NIC
e6e37f63SOtto Sabart2) software filters can easily be added to hash over new protocols
e6e37f63SOtto Sabart3) it does not increase hardware device interrupt rate (although it does
e6e37f63SOtto Sabart   introduce inter-processor interrupts (IPIs))
e6e37f63SOtto Sabart
e6e37f63SOtto SabartRPS is called during bottom half of the receive interrupt handler, when
e6e37f63SOtto Sabarta driver sends a packet up the network stack with netif_rx() or
e6e37f63SOtto Sabartnetif_receive_skb(). These call the get_rps_cpu() function, which
e6e37f63SOtto Sabartselects the queue that should process a packet.
e6e37f63SOtto Sabart
e6e37f63SOtto SabartThe first step in determining the target CPU for RPS is to calculate a
e6e37f63SOtto Sabartflow hash over the packet’s addresses or ports (2-tuple or 4-tuple hash
e6e37f63SOtto Sabartdepending on the protocol). This serves as a consistent hash of the
e6e37f63SOtto Sabartassociated flow of the packet. The hash is either provided by hardware
e6e37f63SOtto Sabartor will be computed in the stack. Capable hardware can pass the hash in
e6e37f63SOtto Sabartthe receive descriptor for the packet; this would usually be the same
e6e37f63SOtto Sabarthash used for RSS (e.g. computed Toeplitz hash). The hash is saved in
e6e37f63SOtto Sabartskb->hash and can be used elsewhere in the stack as a hash of the
e6e37f63SOtto Sabartpacket’s flow.
e6e37f63SOtto Sabart
e6e37f63SOtto SabartEach receive hardware queue has an associated list of CPUs to which
e6e37f63SOtto SabartRPS may enqueue packets for processing. For each received packet,
e6e37f63SOtto Sabartan index into the list is computed from the flow hash modulo the size
e6e37f63SOtto Sabartof the list. The indexed CPU is the target for processing the packet,
e6e37f63SOtto Sabartand the packet is queued to the tail of that CPU’s backlog queue. At
e6e37f63SOtto Sabartthe end of the bottom half routine, IPIs are sent to any CPUs for which
e6e37f63SOtto Sabartpackets have been queued to their backlog queue. The IPI wakes backlog
e6e37f63SOtto Sabartprocessing on the remote CPU, and any queued packets are then processed
e6e37f63SOtto Sabartup the networking stack.
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart
e6e37f63SOtto SabartRPS Configuration
e6e37f63SOtto Sabart-----------------
e6e37f63SOtto Sabart
e6e37f63SOtto SabartRPS requires a kernel compiled with the CONFIG_RPS kconfig symbol (on
e6e37f63SOtto Sabartby default for SMP). Even when compiled in, RPS remains disabled until
e6e37f63SOtto Sabartexplicitly configured. The list of CPUs to which RPS may forward traffic
e6e37f63SOtto Sabartcan be configured for each receive queue using a sysfs file entry::
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart  /sys/class/net/<dev>/queues/rx-<n>/rps_cpus
e6e37f63SOtto Sabart
e6e37f63SOtto SabartThis file implements a bitmap of CPUs. RPS is disabled when it is zero
e6e37f63SOtto Sabart(the default), in which case packets are processed on the interrupting
e00b0ab8SMauro Carvalho ChehabCPU. Documentation/core-api/irq/irq-affinity.rst explains how CPUs are assigned to
e6e37f63SOtto Sabartthe bitmap.
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart
e6e37f63SOtto SabartSuggested Configuration
e6e37f63SOtto Sabart~~~~~~~~~~~~~~~~~~~~~~~
e6e37f63SOtto Sabart
e6e37f63SOtto SabartFor a single queue device, a typical RPS configuration would be to set
e6e37f63SOtto Sabartthe rps_cpus to the CPUs in the same memory domain of the interrupting
e6e37f63SOtto SabartCPU. If NUMA locality is not an issue, this could also be all CPUs in
e6e37f63SOtto Sabartthe system. At high interrupt rate, it might be wise to exclude the
e6e37f63SOtto Sabartinterrupting CPU from the map since that already performs much work.
e6e37f63SOtto Sabart
e6e37f63SOtto SabartFor a multi-queue system, if RSS is configured so that a hardware
e6e37f63SOtto Sabartreceive queue is mapped to each CPU, then RPS is probably redundant
e6e37f63SOtto Sabartand unnecessary. If there are fewer hardware queues than CPUs, then
e6e37f63SOtto SabartRPS might be beneficial if the rps_cpus for each queue are the ones that
e6e37f63SOtto Sabartshare the same memory domain as the interrupting CPU for that queue.
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart
e6e37f63SOtto SabartRPS Flow Limit
e6e37f63SOtto Sabart--------------
e6e37f63SOtto Sabart
e6e37f63SOtto SabartRPS scales kernel receive processing across CPUs without introducing
e6e37f63SOtto Sabartreordering. The trade-off to sending all packets from the same flow
e6e37f63SOtto Sabartto the same CPU is CPU load imbalance if flows vary in packet rate.
e6e37f63SOtto SabartIn the extreme case a single flow dominates traffic. Especially on
e6e37f63SOtto Sabartcommon server workloads with many concurrent connections, such
e6e37f63SOtto Sabartbehavior indicates a problem such as a misconfiguration or spoofed
e6e37f63SOtto Sabartsource Denial of Service attack.
e6e37f63SOtto Sabart
e6e37f63SOtto SabartFlow Limit is an optional RPS feature that prioritizes small flows
e6e37f63SOtto Sabartduring CPU contention by dropping packets from large flows slightly
e6e37f63SOtto Sabartahead of those from small flows. It is active only when an RPS or RFS
e6e37f63SOtto Sabartdestination CPU approaches saturation.  Once a CPU's input packet
e6e37f63SOtto Sabartqueue exceeds half the maximum queue length (as set by sysctl
e6e37f63SOtto Sabartnet.core.netdev_max_backlog), the kernel starts a per-flow packet
e6e37f63SOtto Sabartcount over the last 256 packets. If a flow exceeds a set ratio (by
e6e37f63SOtto Sabartdefault, half) of these packets when a new packet arrives, then the
e6e37f63SOtto Sabartnew packet is dropped. Packets from other flows are still only
e6e37f63SOtto Sabartdropped once the input packet queue reaches netdev_max_backlog.
e6e37f63SOtto SabartNo packets are dropped when the input packet queue length is below
e6e37f63SOtto Sabartthe threshold, so flow limit does not sever connections outright:
e6e37f63SOtto Sabarteven large flows maintain connectivity.
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart
e6e37f63SOtto SabartInterface
e6e37f63SOtto Sabart~~~~~~~~~
e6e37f63SOtto Sabart
e6e37f63SOtto SabartFlow limit is compiled in by default (CONFIG_NET_FLOW_LIMIT), but not
e6e37f63SOtto Sabartturned on. It is implemented for each CPU independently (to avoid lock
e6e37f63SOtto Sabartand cache contention) and toggled per CPU by setting the relevant bit
e6e37f63SOtto Sabartin sysctl net.core.flow_limit_cpu_bitmap. It exposes the same CPU
e6e37f63SOtto Sabartbitmap interface as rps_cpus (see above) when called from procfs::
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart  /proc/sys/net/core/flow_limit_cpu_bitmap
e6e37f63SOtto Sabart
e6e37f63SOtto SabartPer-flow rate is calculated by hashing each packet into a hashtable
e6e37f63SOtto Sabartbucket and incrementing a per-bucket counter. The hash function is
e6e37f63SOtto Sabartthe same that selects a CPU in RPS, but as the number of buckets can
e6e37f63SOtto Sabartbe much larger than the number of CPUs, flow limit has finer-grained
e6e37f63SOtto Sabartidentification of large flows and fewer false positives. The default
e6e37f63SOtto Sabarttable has 4096 buckets. This value can be modified through sysctl::
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart  net.core.flow_limit_table_len
e6e37f63SOtto Sabart
e6e37f63SOtto SabartThe value is only consulted when a new table is allocated. Modifying
e6e37f63SOtto Sabartit does not update active tables.
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart
e6e37f63SOtto SabartSuggested Configuration
e6e37f63SOtto Sabart~~~~~~~~~~~~~~~~~~~~~~~
e6e37f63SOtto Sabart
e6e37f63SOtto SabartFlow limit is useful on systems with many concurrent connections,
e6e37f63SOtto Sabartwhere a single connection taking up 50% of a CPU indicates a problem.
e6e37f63SOtto SabartIn such environments, enable the feature on all CPUs that handle
e6e37f63SOtto Sabartnetwork rx interrupts (as set in /proc/irq/N/smp_affinity).
e6e37f63SOtto Sabart
e6e37f63SOtto SabartThe feature depends on the input packet queue length to exceed
e6e37f63SOtto Sabartthe flow limit threshold (50%) + the flow history length (256).
e6e37f63SOtto SabartSetting net.core.netdev_max_backlog to either 1000 or 10000
e6e37f63SOtto Sabartperformed well in experiments.
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart
e6e37f63SOtto SabartRFS: Receive Flow Steering
e6e37f63SOtto Sabart==========================
e6e37f63SOtto Sabart
e6e37f63SOtto SabartWhile RPS steers packets solely based on hash, and thus generally
e6e37f63SOtto Sabartprovides good load distribution, it does not take into account
e6e37f63SOtto Sabartapplication locality. This is accomplished by Receive Flow Steering
e6e37f63SOtto Sabart(RFS). The goal of RFS is to increase datacache hitrate by steering
e6e37f63SOtto Sabartkernel processing of packets to the CPU where the application thread
e6e37f63SOtto Sabartconsuming the packet is running. RFS relies on the same RPS mechanisms
e6e37f63SOtto Sabartto enqueue packets onto the backlog of another CPU and to wake up that
e6e37f63SOtto SabartCPU.
e6e37f63SOtto Sabart
e6e37f63SOtto SabartIn RFS, packets are not forwarded directly by the value of their hash,
e6e37f63SOtto Sabartbut the hash is used as index into a flow lookup table. This table maps
e6e37f63SOtto Sabartflows to the CPUs where those flows are being processed. The flow hash
e6e37f63SOtto Sabart(see RPS section above) is used to calculate the index into this table.
e6e37f63SOtto SabartThe CPU recorded in each entry is the one which last processed the flow.
e6e37f63SOtto SabartIf an entry does not hold a valid CPU, then packets mapped to that entry
e6e37f63SOtto Sabartare steered using plain RPS. Multiple table entries may point to the
e6e37f63SOtto Sabartsame CPU. Indeed, with many flows and few CPUs, it is very likely that
e6e37f63SOtto Sabarta single application thread handles flows with many different flow hashes.
e6e37f63SOtto Sabart
e6e37f63SOtto Sabartrps_sock_flow_table is a global flow table that contains the *desired* CPU
e6e37f63SOtto Sabartfor flows: the CPU that is currently processing the flow in userspace.
e6e37f63SOtto SabartEach table value is a CPU index that is updated during calls to recvmsg
*dc97391eSDavid Howellsand sendmsg (specifically, inet_recvmsg(), inet_sendmsg() and
*dc97391eSDavid Howellstcp_splice_read()).
e6e37f63SOtto Sabart
e6e37f63SOtto SabartWhen the scheduler moves a thread to a new CPU while it has outstanding
e6e37f63SOtto Sabartreceive packets on the old CPU, packets may arrive out of order. To
e6e37f63SOtto Sabartavoid this, RFS uses a second flow table to track outstanding packets
e6e37f63SOtto Sabartfor each flow: rps_dev_flow_table is a table specific to each hardware
e6e37f63SOtto Sabartreceive queue of each device. Each table value stores a CPU index and a
e6e37f63SOtto Sabartcounter. The CPU index represents the *current* CPU onto which packets
e6e37f63SOtto Sabartfor this flow are enqueued for further kernel processing. Ideally, kernel
e6e37f63SOtto Sabartand userspace processing occur on the same CPU, and hence the CPU index
e6e37f63SOtto Sabartin both tables is identical. This is likely false if the scheduler has
e6e37f63SOtto Sabartrecently migrated a userspace thread while the kernel still has packets
e6e37f63SOtto Sabartenqueued for kernel processing on the old CPU.
e6e37f63SOtto Sabart
e6e37f63SOtto SabartThe counter in rps_dev_flow_table values records the length of the current
e6e37f63SOtto SabartCPU's backlog when a packet in this flow was last enqueued. Each backlog
e6e37f63SOtto Sabartqueue has a head counter that is incremented on dequeue. A tail counter
e6e37f63SOtto Sabartis computed as head counter + queue length. In other words, the counter
e6e37f63SOtto Sabartin rps_dev_flow[i] records the last element in flow i that has
e6e37f63SOtto Sabartbeen enqueued onto the currently designated CPU for flow i (of course,
e6e37f63SOtto Sabartentry i is actually selected by hash and multiple flows may hash to the
e6e37f63SOtto Sabartsame entry i).
e6e37f63SOtto Sabart
e6e37f63SOtto SabartAnd now the trick for avoiding out of order packets: when selecting the
e6e37f63SOtto SabartCPU for packet processing (from get_rps_cpu()) the rps_sock_flow table
e6e37f63SOtto Sabartand the rps_dev_flow table of the queue that the packet was received on
e6e37f63SOtto Sabartare compared. If the desired CPU for the flow (found in the
e6e37f63SOtto Sabartrps_sock_flow table) matches the current CPU (found in the rps_dev_flow
e6e37f63SOtto Sabarttable), the packet is enqueued onto that CPU’s backlog. If they differ,
e6e37f63SOtto Sabartthe current CPU is updated to match the desired CPU if one of the
e6e37f63SOtto Sabartfollowing is true:
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart  - The current CPU's queue head counter >= the recorded tail counter
e6e37f63SOtto Sabart    value in rps_dev_flow[i]
e6e37f63SOtto Sabart  - The current CPU is unset (>= nr_cpu_ids)
e6e37f63SOtto Sabart  - The current CPU is offline
e6e37f63SOtto Sabart
e6e37f63SOtto SabartAfter this check, the packet is sent to the (possibly updated) current
e6e37f63SOtto SabartCPU. These rules aim to ensure that a flow only moves to a new CPU when
e6e37f63SOtto Sabartthere are no packets outstanding on the old CPU, as the outstanding
e6e37f63SOtto Sabartpackets could arrive later than those about to be processed on the new
e6e37f63SOtto SabartCPU.
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart
e6e37f63SOtto SabartRFS Configuration
e6e37f63SOtto Sabart-----------------
e6e37f63SOtto Sabart
e6e37f63SOtto SabartRFS is only available if the kconfig symbol CONFIG_RPS is enabled (on
e6e37f63SOtto Sabartby default for SMP). The functionality remains disabled until explicitly
e6e37f63SOtto Sabartconfigured. The number of entries in the global flow table is set through::
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart  /proc/sys/net/core/rps_sock_flow_entries
e6e37f63SOtto Sabart
e6e37f63SOtto SabartThe number of entries in the per-queue flow table are set through::
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart  /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart
e6e37f63SOtto SabartSuggested Configuration
e6e37f63SOtto Sabart~~~~~~~~~~~~~~~~~~~~~~~
e6e37f63SOtto Sabart
e6e37f63SOtto SabartBoth of these need to be set before RFS is enabled for a receive queue.
e6e37f63SOtto SabartValues for both are rounded up to the nearest power of two. The
e6e37f63SOtto Sabartsuggested flow count depends on the expected number of active connections
e6e37f63SOtto Sabartat any given time, which may be significantly less than the number of open
e6e37f63SOtto Sabartconnections. We have found that a value of 32768 for rps_sock_flow_entries
e6e37f63SOtto Sabartworks fairly well on a moderately loaded server.
e6e37f63SOtto Sabart
e6e37f63SOtto SabartFor a single queue device, the rps_flow_cnt value for the single queue
e6e37f63SOtto Sabartwould normally be configured to the same value as rps_sock_flow_entries.
e6e37f63SOtto SabartFor a multi-queue device, the rps_flow_cnt for each queue might be
e6e37f63SOtto Sabartconfigured as rps_sock_flow_entries / N, where N is the number of
e6e37f63SOtto Sabartqueues. So for instance, if rps_sock_flow_entries is set to 32768 and there
e6e37f63SOtto Sabartare 16 configured receive queues, rps_flow_cnt for each queue might be
e6e37f63SOtto Sabartconfigured as 2048.
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart
e6e37f63SOtto SabartAccelerated RFS
e6e37f63SOtto Sabart===============
e6e37f63SOtto Sabart
e6e37f63SOtto SabartAccelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated load
e6e37f63SOtto Sabartbalancing mechanism that uses soft state to steer flows based on where
e6e37f63SOtto Sabartthe application thread consuming the packets of each flow is running.
e6e37f63SOtto SabartAccelerated RFS should perform better than RFS since packets are sent
e6e37f63SOtto Sabartdirectly to a CPU local to the thread consuming the data. The target CPU
e6e37f63SOtto Sabartwill either be the same CPU where the application runs, or at least a CPU
e6e37f63SOtto Sabartwhich is local to the application thread’s CPU in the cache hierarchy.
e6e37f63SOtto Sabart
e6e37f63SOtto SabartTo enable accelerated RFS, the networking stack calls the
e6e37f63SOtto Sabartndo_rx_flow_steer driver function to communicate the desired hardware
e6e37f63SOtto Sabartqueue for packets matching a particular flow. The network stack
e6e37f63SOtto Sabartautomatically calls this function every time a flow entry in
e6e37f63SOtto Sabartrps_dev_flow_table is updated. The driver in turn uses a device specific
e6e37f63SOtto Sabartmethod to program the NIC to steer the packets.
e6e37f63SOtto Sabart
e6e37f63SOtto SabartThe hardware queue for a flow is derived from the CPU recorded in
e6e37f63SOtto Sabartrps_dev_flow_table. The stack consults a CPU to hardware queue map which
e6e37f63SOtto Sabartis maintained by the NIC driver. This is an auto-generated reverse map of
e6e37f63SOtto Sabartthe IRQ affinity table shown by /proc/interrupts. Drivers can use
e6e37f63SOtto Sabartfunctions in the cpu_rmap (“CPU affinity reverse map”) kernel library
e6e37f63SOtto Sabartto populate the map. For each CPU, the corresponding queue in the map is
e6e37f63SOtto Sabartset to be one whose processing CPU is closest in cache locality.
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart
e6e37f63SOtto SabartAccelerated RFS Configuration
e6e37f63SOtto Sabart-----------------------------
e6e37f63SOtto Sabart
e6e37f63SOtto SabartAccelerated RFS is only available if the kernel is compiled with
e6e37f63SOtto SabartCONFIG_RFS_ACCEL and support is provided by the NIC device and driver.
e6e37f63SOtto SabartIt also requires that ntuple filtering is enabled via ethtool. The map
e6e37f63SOtto Sabartof CPU to queues is automatically deduced from the IRQ affinities
e6e37f63SOtto Sabartconfigured for each receive queue by the driver, so no additional
e6e37f63SOtto Sabartconfiguration should be necessary.
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart
e6e37f63SOtto SabartSuggested Configuration
e6e37f63SOtto Sabart~~~~~~~~~~~~~~~~~~~~~~~
e6e37f63SOtto Sabart
e6e37f63SOtto SabartThis technique should be enabled whenever one wants to use RFS and the
e6e37f63SOtto SabartNIC supports hardware acceleration.
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart
e6e37f63SOtto SabartXPS: Transmit Packet Steering
e6e37f63SOtto Sabart=============================
e6e37f63SOtto Sabart
e6e37f63SOtto SabartTransmit Packet Steering is a mechanism for intelligently selecting
e6e37f63SOtto Sabartwhich transmit queue to use when transmitting a packet on a multi-queue
e6e37f63SOtto Sabartdevice. This can be accomplished by recording two kinds of maps, either
e6e37f63SOtto Sabarta mapping of CPU to hardware queue(s) or a mapping of receive queue(s)
e6e37f63SOtto Sabartto hardware transmit queue(s).
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart1. XPS using CPUs map
e6e37f63SOtto Sabart
e6e37f63SOtto SabartThe goal of this mapping is usually to assign queues
e6e37f63SOtto Sabartexclusively to a subset of CPUs, where the transmit completions for
e6e37f63SOtto Sabartthese queues are processed on a CPU within this set. This choice
e6e37f63SOtto Sabartprovides two benefits. First, contention on the device queue lock is
e6e37f63SOtto Sabartsignificantly reduced since fewer CPUs contend for the same queue
e6e37f63SOtto Sabart(contention can be eliminated completely if each CPU has its own
e6e37f63SOtto Sabarttransmit queue). Secondly, cache miss rate on transmit completion is
e6e37f63SOtto Sabartreduced, in particular for data cache lines that hold the sk_buff
e6e37f63SOtto Sabartstructures.
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart2. XPS using receive queues map
e6e37f63SOtto Sabart
e6e37f63SOtto SabartThis mapping is used to pick transmit queue based on the receive
e6e37f63SOtto Sabartqueue(s) map configuration set by the administrator. A set of receive
e6e37f63SOtto Sabartqueues can be mapped to a set of transmit queues (many:many), although
e6e37f63SOtto Sabartthe common use case is a 1:1 mapping. This will enable sending packets
e6e37f63SOtto Sabarton the same queue associations for transmit and receive. This is useful for
e6e37f63SOtto Sabartbusy polling multi-threaded workloads where there are challenges in
e6e37f63SOtto Sabartassociating a given CPU to a given application thread. The application
e6e37f63SOtto Sabartthreads are not pinned to CPUs and each thread handles packets
e6e37f63SOtto Sabartreceived on a single queue. The receive queue number is cached in the
e6e37f63SOtto Sabartsocket for the connection. In this model, sending the packets on the same
e6e37f63SOtto Sabarttransmit queue corresponding to the associated receive queue has benefits
e6e37f63SOtto Sabartin keeping the CPU overhead low. Transmit completion work is locked into
e6e37f63SOtto Sabartthe same queue-association that a given application is polling on. This
e6e37f63SOtto Sabartavoids the overhead of triggering an interrupt on another CPU. When the
e6e37f63SOtto Sabartapplication cleans up the packets during the busy poll, transmit completion
e6e37f63SOtto Sabartmay be processed along with it in the same thread context and so result in
e6e37f63SOtto Sabartreduced latency.
e6e37f63SOtto Sabart
e6e37f63SOtto SabartXPS is configured per transmit queue by setting a bitmap of
e6e37f63SOtto SabartCPUs/receive-queues that may use that queue to transmit. The reverse
e6e37f63SOtto Sabartmapping, from CPUs to transmit queues or from receive-queues to transmit
e6e37f63SOtto Sabartqueues, is computed and maintained for each network device. When
e6e37f63SOtto Sabarttransmitting the first packet in a flow, the function get_xps_queue() is
e6e37f63SOtto Sabartcalled to select a queue. This function uses the ID of the receive queue
e6e37f63SOtto Sabartfor the socket connection for a match in the receive queue-to-transmit queue
e6e37f63SOtto Sabartlookup table. Alternatively, this function can also use the ID of the
e6e37f63SOtto Sabartrunning CPU as a key into the CPU-to-queue lookup table. If the
e6e37f63SOtto SabartID matches a single queue, that is used for transmission. If multiple
e6e37f63SOtto Sabartqueues match, one is selected by using the flow hash to compute an index
e6e37f63SOtto Sabartinto the set. When selecting the transmit queue based on receive queue(s)
e6e37f63SOtto Sabartmap, the transmit device is not validated against the receive device as it
e6e37f63SOtto Sabartrequires expensive lookup operation in the datapath.
e6e37f63SOtto Sabart
e6e37f63SOtto SabartThe queue chosen for transmitting a particular flow is saved in the
e6e37f63SOtto Sabartcorresponding socket structure for the flow (e.g. a TCP connection).
e6e37f63SOtto SabartThis transmit queue is used for subsequent packets sent on the flow to
e6e37f63SOtto Sabartprevent out of order (ooo) packets. The choice also amortizes the cost
e6e37f63SOtto Sabartof calling get_xps_queues() over all packets in the flow. To avoid
e6e37f63SOtto Sabartooo packets, the queue for a flow can subsequently only be changed if
e6e37f63SOtto Sabartskb->ooo_okay is set for a packet in the flow. This flag indicates that
e6e37f63SOtto Sabartthere are no outstanding packets in the flow, so the transmit queue can
e6e37f63SOtto Sabartchange without the risk of generating out of order packets. The
e6e37f63SOtto Sabarttransport layer is responsible for setting ooo_okay appropriately. TCP,
e6e37f63SOtto Sabartfor instance, sets the flag when all data for a connection has been
e6e37f63SOtto Sabartacknowledged.
e6e37f63SOtto Sabart
e6e37f63SOtto SabartXPS Configuration
e6e37f63SOtto Sabart-----------------
e6e37f63SOtto Sabart
e6e37f63SOtto SabartXPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
254941f3SWillem de Bruijndefault for SMP). If compiled in, it is driver dependent whether, and
254941f3SWillem de Bruijnhow, XPS is configured at device init. The mapping of CPUs/receive-queues
254941f3SWillem de Bruijnto transmit queue can be inspected and configured using sysfs:
e6e37f63SOtto Sabart
e6e37f63SOtto SabartFor selection based on CPUs map::
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart  /sys/class/net/<dev>/queues/tx-<n>/xps_cpus
e6e37f63SOtto Sabart
e6e37f63SOtto SabartFor selection based on receive-queues map::
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart  /sys/class/net/<dev>/queues/tx-<n>/xps_rxqs
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart
e6e37f63SOtto SabartSuggested Configuration
e6e37f63SOtto Sabart~~~~~~~~~~~~~~~~~~~~~~~
e6e37f63SOtto Sabart
e6e37f63SOtto SabartFor a network device with a single transmission queue, XPS configuration
e6e37f63SOtto Sabarthas no effect, since there is no choice in this case. In a multi-queue
e6e37f63SOtto Sabartsystem, XPS is preferably configured so that each CPU maps onto one queue.
e6e37f63SOtto SabartIf there are as many queues as there are CPUs in the system, then each
e6e37f63SOtto Sabartqueue can also map onto one CPU, resulting in exclusive pairings that
e6e37f63SOtto Sabartexperience no contention. If there are fewer queues than CPUs, then the
e6e37f63SOtto Sabartbest CPUs to share a given queue are probably those that share the cache
e6e37f63SOtto Sabartwith the CPU that processes transmit completions for that queue
e6e37f63SOtto Sabart(transmit interrupts).
e6e37f63SOtto Sabart
e6e37f63SOtto SabartFor transmit queue selection based on receive queue(s), XPS has to be
e6e37f63SOtto Sabartexplicitly configured mapping receive-queue(s) to transmit queue(s). If the
e6e37f63SOtto Sabartuser configuration for receive-queue map does not apply, then the transmit
e6e37f63SOtto Sabartqueue is selected based on the CPUs map.
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart
e6e37f63SOtto SabartPer TX Queue rate limitation
e6e37f63SOtto Sabart============================
e6e37f63SOtto Sabart
e6e37f63SOtto SabartThese are rate-limitation mechanisms implemented by HW, where currently
e6e37f63SOtto Sabarta max-rate attribute is supported, by setting a Mbps value to::
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart  /sys/class/net/<dev>/queues/tx-<n>/tx_maxrate
e6e37f63SOtto Sabart
e6e37f63SOtto SabartA value of zero means disabled, and this is the default.
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart
e6e37f63SOtto SabartFurther Information
e6e37f63SOtto Sabart===================
e6e37f63SOtto SabartRPS and RFS were introduced in kernel 2.6.35. XPS was incorporated into
e6e37f63SOtto Sabart2.6.38. Original patches were submitted by Tom Herbert
e6e37f63SOtto Sabart(therbert@google.com)
e6e37f63SOtto Sabart
e6e37f63SOtto SabartAccelerated RFS was introduced in 2.6.35. Original patches were
e6e37f63SOtto Sabartsubmitted by Ben Hutchings (bwh@kernel.org)
e6e37f63SOtto Sabart
e6e37f63SOtto SabartAuthors:
e6e37f63SOtto Sabart
e6e37f63SOtto Sabart- Tom Herbert (therbert@google.com)
e6e37f63SOtto Sabart- Willem de Bruijn (willemb@google.com)