16fb4825eSEdward Cree.. SPDX-License-Identifier: GPL-2.0 26fb4825eSEdward Cree 36fb4825eSEdward Cree============================= 46fb4825eSEdward CreeNetwork Function Representors 56fb4825eSEdward Cree============================= 66fb4825eSEdward Cree 76fb4825eSEdward CreeThis document describes the semantics and usage of representor netdevices, as 86fb4825eSEdward Creeused to control internal switching on SmartNICs. For the closely-related port 96fb4825eSEdward Creerepresentors on physical (multi-port) switches, see 106fb4825eSEdward Cree:ref:`Documentation/networking/switchdev.rst <switchdev>`. 116fb4825eSEdward Cree 126fb4825eSEdward CreeMotivation 136fb4825eSEdward Cree---------- 146fb4825eSEdward Cree 156fb4825eSEdward CreeSince the mid-2010s, network cards have started offering more complex 166fb4825eSEdward Creevirtualisation capabilities than the legacy SR-IOV approach (with its simple 176fb4825eSEdward CreeMAC/VLAN-based switching model) can support. This led to a desire to offload 186fb4825eSEdward Creesoftware-defined networks (such as OpenVSwitch) to these NICs to specify the 196fb4825eSEdward Creenetwork connectivity of each function. The resulting designs are variously 206fb4825eSEdward Creecalled SmartNICs or DPUs. 216fb4825eSEdward Cree 226fb4825eSEdward CreeNetwork function representors bring the standard Linux networking stack to 236fb4825eSEdward Creevirtual switches and IOV devices. Just as each physical port of a Linux- 246fb4825eSEdward Creecontrolled switch has a separate netdev, so does each virtual port of a virtual 256fb4825eSEdward Creeswitch. 266fb4825eSEdward CreeWhen the system boots, and before any offload is configured, all packets from 276fb4825eSEdward Creethe virtual functions appear in the networking stack of the PF via the 286fb4825eSEdward Creerepresentors. The PF can thus always communicate freely with the virtual 296fb4825eSEdward Creefunctions. 306fb4825eSEdward CreeThe PF can configure standard Linux forwarding between representors, the uplink 316fb4825eSEdward Creeor any other netdev (routing, bridging, TC classifiers). 326fb4825eSEdward Cree 336fb4825eSEdward CreeThus, a representor is both a control plane object (representing the function in 346fb4825eSEdward Creeadministrative commands) and a data plane object (one end of a virtual pipe). 356fb4825eSEdward CreeAs a virtual link endpoint, the representor can be configured like any other 366fb4825eSEdward Creenetdevice; in some cases (e.g. link state) the representee will follow the 376fb4825eSEdward Creerepresentor's configuration, while in others there are separate APIs to 386fb4825eSEdward Creeconfigure the representee. 396fb4825eSEdward Cree 406fb4825eSEdward CreeDefinitions 416fb4825eSEdward Cree----------- 426fb4825eSEdward Cree 436fb4825eSEdward CreeThis document uses the term "switchdev function" to refer to the PCIe function 446fb4825eSEdward Creewhich has administrative control over the virtual switch on the device. 456fb4825eSEdward CreeTypically, this will be a PF, but conceivably a NIC could be configured to grant 466fb4825eSEdward Creethese administrative privileges instead to a VF or SF (subfunction). 476fb4825eSEdward CreeDepending on NIC design, a multi-port NIC might have a single switchdev function 486fb4825eSEdward Creefor the whole device or might have a separate virtual switch, and hence 496fb4825eSEdward Creeswitchdev function, for each physical network port. 506fb4825eSEdward CreeIf the NIC supports nested switching, there might be separate switchdev 516fb4825eSEdward Creefunctions for each nested switch, in which case each switchdev function should 526fb4825eSEdward Creeonly create representors for the ports on the (sub-)switch it directly 536fb4825eSEdward Creeadministers. 546fb4825eSEdward Cree 556fb4825eSEdward CreeA "representee" is the object that a representor represents. So for example in 566fb4825eSEdward Creethe case of a VF representor, the representee is the corresponding VF. 576fb4825eSEdward Cree 586fb4825eSEdward CreeWhat does a representor do? 596fb4825eSEdward Cree--------------------------- 606fb4825eSEdward Cree 616fb4825eSEdward CreeA representor has three main roles. 626fb4825eSEdward Cree 636fb4825eSEdward Cree1. It is used to configure the network connection the representee sees, e.g. 646fb4825eSEdward Cree link up/down, MTU, etc. For instance, bringing the representor 656fb4825eSEdward Cree administratively UP should cause the representee to see a link up / carrier 666fb4825eSEdward Cree on event. 676fb4825eSEdward Cree2. It provides the slow path for traffic which does not hit any offloaded 686fb4825eSEdward Cree fast-path rules in the virtual switch. Packets transmitted on the 696fb4825eSEdward Cree representor netdevice should be delivered to the representee; packets 706fb4825eSEdward Cree transmitted by the representee which fail to match any switching rule should 716fb4825eSEdward Cree be received on the representor netdevice. (That is, there is a virtual pipe 726fb4825eSEdward Cree connecting the representor to the representee, similar in concept to a veth 736fb4825eSEdward Cree pair.) 746fb4825eSEdward Cree This allows software switch implementations (such as OpenVSwitch or a Linux 756fb4825eSEdward Cree bridge) to forward packets between representees and the rest of the network. 766fb4825eSEdward Cree3. It acts as a handle by which switching rules (such as TC filters) can refer 776fb4825eSEdward Cree to the representee, allowing these rules to be offloaded. 786fb4825eSEdward Cree 796fb4825eSEdward CreeThe combination of 2) and 3) means that the behaviour (apart from performance) 806fb4825eSEdward Creeshould be the same whether a TC filter is offloaded or not. E.g. a TC rule 816fb4825eSEdward Creeon a VF representor applies in software to packets received on that representor 826fb4825eSEdward Creenetdevice, while in hardware offload it would apply to packets transmitted by 836fb4825eSEdward Creethe representee VF. Conversely, a mirred egress redirect to a VF representor 846fb4825eSEdward Creecorresponds in hardware to delivery directly to the representee VF. 856fb4825eSEdward Cree 866fb4825eSEdward CreeWhat functions should have a representor? 876fb4825eSEdward Cree----------------------------------------- 886fb4825eSEdward Cree 896fb4825eSEdward CreeEssentially, for each virtual port on the device's internal switch, there 906fb4825eSEdward Creeshould be a representor. 916fb4825eSEdward CreeSome vendors have chosen to omit representors for the uplink and the physical 926fb4825eSEdward Creenetwork port, which can simplify usage (the uplink netdev becomes in effect the 936fb4825eSEdward Creephysical port's representor) but does not generalise to devices with multiple 946fb4825eSEdward Creeports or uplinks. 956fb4825eSEdward Cree 966fb4825eSEdward CreeThus, the following should all have representors: 976fb4825eSEdward Cree 986fb4825eSEdward Cree - VFs belonging to the switchdev function. 996fb4825eSEdward Cree - Other PFs on the local PCIe controller, and any VFs belonging to them. 1006fb4825eSEdward Cree - PFs and VFs on external PCIe controllers on the device (e.g. for any embedded 1016fb4825eSEdward Cree System-on-Chip within the SmartNIC). 1026fb4825eSEdward Cree - PFs and VFs with other personalities, including network block devices (such 1036fb4825eSEdward Cree as a vDPA virtio-blk PF backed by remote/distributed storage), if (and only 1046fb4825eSEdward Cree if) their network access is implemented through a virtual switch port. [#]_ 1056fb4825eSEdward Cree Note that such functions can require a representor despite the representee 1066fb4825eSEdward Cree not having a netdev. 1076fb4825eSEdward Cree - Subfunctions (SFs) belonging to any of the above PFs or VFs, if they have 1086fb4825eSEdward Cree their own port on the switch (as opposed to using their parent PF's port). 1096fb4825eSEdward Cree - Any accelerators or plugins on the device whose interface to the network is 1106fb4825eSEdward Cree through a virtual switch port, even if they do not have a corresponding PCIe 1116fb4825eSEdward Cree PF or VF. 1126fb4825eSEdward Cree 1136fb4825eSEdward CreeThis allows the entire switching behaviour of the NIC to be controlled through 1146fb4825eSEdward Creerepresentor TC rules. 1156fb4825eSEdward Cree 1166fb4825eSEdward CreeIt is a common misunderstanding to conflate virtual ports with PCIe virtual 1176fb4825eSEdward Creefunctions or their netdevs. While in simple cases there will be a 1:1 1186fb4825eSEdward Creecorrespondence between VF netdevices and VF representors, more advanced device 1196fb4825eSEdward Creeconfigurations may not follow this. 1206fb4825eSEdward CreeA PCIe function which does not have network access through the internal switch 1216fb4825eSEdward Cree(not even indirectly through the hardware implementation of whatever services 1226fb4825eSEdward Creethe function provides) should *not* have a representor (even if it has a 1236fb4825eSEdward Creenetdev). 1246fb4825eSEdward CreeSuch a function has no switch virtual port for the representor to configure or 1256fb4825eSEdward Creeto be the other end of the virtual pipe. 1266fb4825eSEdward CreeThe representor represents the virtual port, not the PCIe function nor the 'end 1276fb4825eSEdward Creeuser' netdevice. 1286fb4825eSEdward Cree 1296fb4825eSEdward Cree.. [#] The concept here is that a hardware IP stack in the device performs the 1306fb4825eSEdward Cree translation between block DMA requests and network packets, so that only 1316fb4825eSEdward Cree network packets pass through the virtual port onto the switch. The network 1326fb4825eSEdward Cree access that the IP stack "sees" would then be configurable through tc rules; 1336fb4825eSEdward Cree e.g. its traffic might all be wrapped in a specific VLAN or VxLAN. However, 1346fb4825eSEdward Cree any needed configuration of the block device *qua* block device, not being a 1356fb4825eSEdward Cree networking entity, would not be appropriate for the representor and would 1366fb4825eSEdward Cree thus use some other channel such as devlink. 1376fb4825eSEdward Cree Contrast this with the case of a virtio-blk implementation which forwards the 1386fb4825eSEdward Cree DMA requests unchanged to another PF whose driver then initiates and 1396fb4825eSEdward Cree terminates IP traffic in software; in that case the DMA traffic would *not* 1406fb4825eSEdward Cree run over the virtual switch and the virtio-blk PF should thus *not* have a 1416fb4825eSEdward Cree representor. 1426fb4825eSEdward Cree 1436fb4825eSEdward CreeHow are representors created? 1446fb4825eSEdward Cree----------------------------- 1456fb4825eSEdward Cree 1466fb4825eSEdward CreeThe driver instance attached to the switchdev function should, for each virtual 1476fb4825eSEdward Creeport on the switch, create a pure-software netdevice which has some form of 1486fb4825eSEdward Creein-kernel reference to the switchdev function's own netdevice or driver private 1496fb4825eSEdward Creedata (``netdev_priv()``). 1506fb4825eSEdward CreeThis may be by enumerating ports at probe time, reacting dynamically to the 1516fb4825eSEdward Creecreation and destruction of ports at run time, or a combination of the two. 1526fb4825eSEdward Cree 1536fb4825eSEdward CreeThe operations of the representor netdevice will generally involve acting 1546fb4825eSEdward Creethrough the switchdev function. For example, ``ndo_start_xmit()`` might send 1556fb4825eSEdward Creethe packet through a hardware TX queue attached to the switchdev function, with 1566fb4825eSEdward Creeeither packet metadata or queue configuration marking it for delivery to the 1576fb4825eSEdward Creerepresentee. 1586fb4825eSEdward Cree 1596fb4825eSEdward CreeHow are representors identified? 1606fb4825eSEdward Cree-------------------------------- 1616fb4825eSEdward Cree 1626fb4825eSEdward CreeThe representor netdevice should *not* directly refer to a PCIe device (e.g. 1636fb4825eSEdward Creethrough ``net_dev->dev.parent`` / ``SET_NETDEV_DEV()``), either of the 1646fb4825eSEdward Creerepresentee or of the switchdev function. 165*a258c804SMateusz PolchlopekInstead, the driver should use the ``SET_NETDEV_DEVLINK_PORT`` macro to 166*a258c804SMateusz Polchlopekassign a devlink port instance to the netdevice before registering the 167*a258c804SMateusz Polchlopeknetdevice; the kernel uses the devlink port to provide the ``phys_switch_id`` 168*a258c804SMateusz Polchlopekand ``phys_port_name`` sysfs nodes. 169*a258c804SMateusz Polchlopek(Some legacy drivers implement ``ndo_get_port_parent_id()`` and 1706fb4825eSEdward Cree``ndo_get_phys_port_name()`` directly, but this is deprecated.) See 1716fb4825eSEdward Cree:ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>` for the 1726fb4825eSEdward Creedetails of this API. 1736fb4825eSEdward Cree 1746fb4825eSEdward CreeIt is expected that userland will use this information (e.g. through udev rules) 1756fb4825eSEdward Creeto construct an appropriately informative name or alias for the netdevice. For 1766fb4825eSEdward Creeinstance if the switchdev function is ``eth4`` then a representor with a 1776fb4825eSEdward Cree``phys_port_name`` of ``p0pf1vf2`` might be renamed ``eth4pf1vf2rep``. 1786fb4825eSEdward Cree 1796fb4825eSEdward CreeThere are as yet no established conventions for naming representors which do not 1806fb4825eSEdward Creecorrespond to PCIe functions (e.g. accelerators and plugins). 1816fb4825eSEdward Cree 1826fb4825eSEdward CreeHow do representors interact with TC rules? 1836fb4825eSEdward Cree------------------------------------------- 1846fb4825eSEdward Cree 1856fb4825eSEdward CreeAny TC rule on a representor applies (in software TC) to packets received by 1866fb4825eSEdward Creethat representor netdevice. Thus, if the delivery part of the rule corresponds 1876fb4825eSEdward Creeto another port on the virtual switch, the driver may choose to offload it to 1886fb4825eSEdward Creehardware, applying it to packets transmitted by the representee. 1896fb4825eSEdward Cree 1906fb4825eSEdward CreeSimilarly, since a TC mirred egress action targeting the representor would (in 1916fb4825eSEdward Creesoftware) send the packet through the representor (and thus indirectly deliver 1926fb4825eSEdward Creeit to the representee), hardware offload should interpret this as delivery to 1936fb4825eSEdward Creethe representee. 1946fb4825eSEdward Cree 1956fb4825eSEdward CreeAs a simple example, if ``PORT_DEV`` is the physical port representor and 1966fb4825eSEdward Cree``REP_DEV`` is a VF representor, the following rules:: 1976fb4825eSEdward Cree 1986fb4825eSEdward Cree tc filter add dev $REP_DEV parent ffff: protocol ipv4 flower \ 1996fb4825eSEdward Cree action mirred egress redirect dev $PORT_DEV 2006fb4825eSEdward Cree tc filter add dev $PORT_DEV parent ffff: protocol ipv4 flower skip_sw \ 2016fb4825eSEdward Cree action mirred egress mirror dev $REP_DEV 2026fb4825eSEdward Cree 2036fb4825eSEdward Creewould mean that all IPv4 packets from the VF are sent out the physical port, and 2046fb4825eSEdward Creeall IPv4 packets received on the physical port are delivered to the VF in 2056fb4825eSEdward Creeaddition to ``PORT_DEV``. (Note that without ``skip_sw`` on the second rule, 2066fb4825eSEdward Creethe VF would get two copies, as the packet reception on ``PORT_DEV`` would 2076fb4825eSEdward Creetrigger the TC rule again and mirror the packet to ``REP_DEV``.) 2086fb4825eSEdward Cree 2096fb4825eSEdward CreeOn devices without separate port and uplink representors, ``PORT_DEV`` would 2106fb4825eSEdward Creeinstead be the switchdev function's own uplink netdevice. 2116fb4825eSEdward Cree 2126fb4825eSEdward CreeOf course the rules can (if supported by the NIC) include packet-modifying 2136fb4825eSEdward Creeactions (e.g. VLAN push/pop), which should be performed by the virtual switch. 2146fb4825eSEdward Cree 2156fb4825eSEdward CreeTunnel encapsulation and decapsulation are rather more complicated, as they 2166fb4825eSEdward Creeinvolve a third netdevice (a tunnel netdev operating in metadata mode, such as 2176fb4825eSEdward Creea VxLAN device created with ``ip link add vxlan0 type vxlan external``) and 2186fb4825eSEdward Creerequire an IP address to be bound to the underlay device (e.g. switchdev 2196fb4825eSEdward Creefunction uplink netdev or port representor). TC rules such as:: 2206fb4825eSEdward Cree 2216fb4825eSEdward Cree tc filter add dev $REP_DEV parent ffff: flower \ 2226fb4825eSEdward Cree action tunnel_key set id $VNI src_ip $LOCAL_IP dst_ip $REMOTE_IP \ 2236fb4825eSEdward Cree dst_port 4789 \ 2246fb4825eSEdward Cree action mirred egress redirect dev vxlan0 2256fb4825eSEdward Cree tc filter add dev vxlan0 parent ffff: flower enc_src_ip $REMOTE_IP \ 2266fb4825eSEdward Cree enc_dst_ip $LOCAL_IP enc_key_id $VNI enc_dst_port 4789 \ 2276fb4825eSEdward Cree action tunnel_key unset action mirred egress redirect dev $REP_DEV 2286fb4825eSEdward Cree 2296fb4825eSEdward Creewhere ``LOCAL_IP`` is an IP address bound to ``PORT_DEV``, and ``REMOTE_IP`` is 2306fb4825eSEdward Creeanother IP address on the same subnet, mean that packets sent by the VF should 2316fb4825eSEdward Creebe VxLAN encapsulated and sent out the physical port (the driver has to deduce 2326fb4825eSEdward Creethis by a route lookup of ``LOCAL_IP`` leading to ``PORT_DEV``, and also 2336fb4825eSEdward Creeperform an ARP/neighbour table lookup to find the MAC addresses to use in the 2346fb4825eSEdward Creeouter Ethernet frame), while UDP packets received on the physical port with UDP 2356fb4825eSEdward Creeport 4789 should be parsed as VxLAN and, if their VSID matches ``$VNI``, 2366fb4825eSEdward Creedecapsulated and forwarded to the VF. 2376fb4825eSEdward Cree 2386fb4825eSEdward CreeIf this all seems complicated, just remember the 'golden rule' of TC offload: 2396fb4825eSEdward Creethe hardware should ensure the same final results as if the packets were 2406fb4825eSEdward Creeprocessed through the slow path, traversed software TC (except ignoring any 2416fb4825eSEdward Cree``skip_hw`` rules and applying any ``skip_sw`` rules) and were transmitted or 2426fb4825eSEdward Creereceived through the representor netdevices. 2436fb4825eSEdward Cree 2446fb4825eSEdward CreeConfiguring the representee's MAC 2456fb4825eSEdward Cree--------------------------------- 2466fb4825eSEdward Cree 2476fb4825eSEdward CreeThe representee's link state is controlled through the representor. Setting the 2486fb4825eSEdward Creerepresentor administratively UP or DOWN should cause carrier ON or OFF at the 2496fb4825eSEdward Creerepresentee. 2506fb4825eSEdward Cree 2516fb4825eSEdward CreeSetting an MTU on the representor should cause that same MTU to be reported to 2526fb4825eSEdward Creethe representee. 2536fb4825eSEdward Cree(On hardware that allows configuring separate and distinct MTU and MRU values, 2546fb4825eSEdward Creethe representor MTU should correspond to the representee's MRU and vice-versa.) 2556fb4825eSEdward Cree 2566fb4825eSEdward CreeCurrently there is no way to use the representor to set the station permanent 2576fb4825eSEdward CreeMAC address of the representee; other methods available to do this include: 2586fb4825eSEdward Cree 2596fb4825eSEdward Cree - legacy SR-IOV (``ip link set DEVICE vf NUM mac LLADDR``) 2606fb4825eSEdward Cree - devlink port function (see **devlink-port(8)** and 2616fb4825eSEdward Cree :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`) 262