16fb4825eSEdward Cree.. SPDX-License-Identifier: GPL-2.0
26fb4825eSEdward Cree
36fb4825eSEdward Cree=============================
46fb4825eSEdward CreeNetwork Function Representors
56fb4825eSEdward Cree=============================
66fb4825eSEdward Cree
76fb4825eSEdward CreeThis document describes the semantics and usage of representor netdevices, as
86fb4825eSEdward Creeused to control internal switching on SmartNICs.  For the closely-related port
96fb4825eSEdward Creerepresentors on physical (multi-port) switches, see
106fb4825eSEdward Cree:ref:`Documentation/networking/switchdev.rst <switchdev>`.
116fb4825eSEdward Cree
126fb4825eSEdward CreeMotivation
136fb4825eSEdward Cree----------
146fb4825eSEdward Cree
156fb4825eSEdward CreeSince the mid-2010s, network cards have started offering more complex
166fb4825eSEdward Creevirtualisation capabilities than the legacy SR-IOV approach (with its simple
176fb4825eSEdward CreeMAC/VLAN-based switching model) can support.  This led to a desire to offload
186fb4825eSEdward Creesoftware-defined networks (such as OpenVSwitch) to these NICs to specify the
196fb4825eSEdward Creenetwork connectivity of each function.  The resulting designs are variously
206fb4825eSEdward Creecalled SmartNICs or DPUs.
216fb4825eSEdward Cree
226fb4825eSEdward CreeNetwork function representors bring the standard Linux networking stack to
236fb4825eSEdward Creevirtual switches and IOV devices.  Just as each physical port of a Linux-
246fb4825eSEdward Creecontrolled switch has a separate netdev, so does each virtual port of a virtual
256fb4825eSEdward Creeswitch.
266fb4825eSEdward CreeWhen the system boots, and before any offload is configured, all packets from
276fb4825eSEdward Creethe virtual functions appear in the networking stack of the PF via the
286fb4825eSEdward Creerepresentors.  The PF can thus always communicate freely with the virtual
296fb4825eSEdward Creefunctions.
306fb4825eSEdward CreeThe PF can configure standard Linux forwarding between representors, the uplink
316fb4825eSEdward Creeor any other netdev (routing, bridging, TC classifiers).
326fb4825eSEdward Cree
336fb4825eSEdward CreeThus, a representor is both a control plane object (representing the function in
346fb4825eSEdward Creeadministrative commands) and a data plane object (one end of a virtual pipe).
356fb4825eSEdward CreeAs a virtual link endpoint, the representor can be configured like any other
366fb4825eSEdward Creenetdevice; in some cases (e.g. link state) the representee will follow the
376fb4825eSEdward Creerepresentor's configuration, while in others there are separate APIs to
386fb4825eSEdward Creeconfigure the representee.
396fb4825eSEdward Cree
406fb4825eSEdward CreeDefinitions
416fb4825eSEdward Cree-----------
426fb4825eSEdward Cree
436fb4825eSEdward CreeThis document uses the term "switchdev function" to refer to the PCIe function
446fb4825eSEdward Creewhich has administrative control over the virtual switch on the device.
456fb4825eSEdward CreeTypically, this will be a PF, but conceivably a NIC could be configured to grant
466fb4825eSEdward Creethese administrative privileges instead to a VF or SF (subfunction).
476fb4825eSEdward CreeDepending on NIC design, a multi-port NIC might have a single switchdev function
486fb4825eSEdward Creefor the whole device or might have a separate virtual switch, and hence
496fb4825eSEdward Creeswitchdev function, for each physical network port.
506fb4825eSEdward CreeIf the NIC supports nested switching, there might be separate switchdev
516fb4825eSEdward Creefunctions for each nested switch, in which case each switchdev function should
526fb4825eSEdward Creeonly create representors for the ports on the (sub-)switch it directly
536fb4825eSEdward Creeadministers.
546fb4825eSEdward Cree
556fb4825eSEdward CreeA "representee" is the object that a representor represents.  So for example in
566fb4825eSEdward Creethe case of a VF representor, the representee is the corresponding VF.
576fb4825eSEdward Cree
586fb4825eSEdward CreeWhat does a representor do?
596fb4825eSEdward Cree---------------------------
606fb4825eSEdward Cree
616fb4825eSEdward CreeA representor has three main roles.
626fb4825eSEdward Cree
636fb4825eSEdward Cree1. It is used to configure the network connection the representee sees, e.g.
646fb4825eSEdward Cree   link up/down, MTU, etc.  For instance, bringing the representor
656fb4825eSEdward Cree   administratively UP should cause the representee to see a link up / carrier
666fb4825eSEdward Cree   on event.
676fb4825eSEdward Cree2. It provides the slow path for traffic which does not hit any offloaded
686fb4825eSEdward Cree   fast-path rules in the virtual switch.  Packets transmitted on the
696fb4825eSEdward Cree   representor netdevice should be delivered to the representee; packets
706fb4825eSEdward Cree   transmitted by the representee which fail to match any switching rule should
716fb4825eSEdward Cree   be received on the representor netdevice.  (That is, there is a virtual pipe
726fb4825eSEdward Cree   connecting the representor to the representee, similar in concept to a veth
736fb4825eSEdward Cree   pair.)
746fb4825eSEdward Cree   This allows software switch implementations (such as OpenVSwitch or a Linux
756fb4825eSEdward Cree   bridge) to forward packets between representees and the rest of the network.
766fb4825eSEdward Cree3. It acts as a handle by which switching rules (such as TC filters) can refer
776fb4825eSEdward Cree   to the representee, allowing these rules to be offloaded.
786fb4825eSEdward Cree
796fb4825eSEdward CreeThe combination of 2) and 3) means that the behaviour (apart from performance)
806fb4825eSEdward Creeshould be the same whether a TC filter is offloaded or not.  E.g. a TC rule
816fb4825eSEdward Creeon a VF representor applies in software to packets received on that representor
826fb4825eSEdward Creenetdevice, while in hardware offload it would apply to packets transmitted by
836fb4825eSEdward Creethe representee VF.  Conversely, a mirred egress redirect to a VF representor
846fb4825eSEdward Creecorresponds in hardware to delivery directly to the representee VF.
856fb4825eSEdward Cree
866fb4825eSEdward CreeWhat functions should have a representor?
876fb4825eSEdward Cree-----------------------------------------
886fb4825eSEdward Cree
896fb4825eSEdward CreeEssentially, for each virtual port on the device's internal switch, there
906fb4825eSEdward Creeshould be a representor.
916fb4825eSEdward CreeSome vendors have chosen to omit representors for the uplink and the physical
926fb4825eSEdward Creenetwork port, which can simplify usage (the uplink netdev becomes in effect the
936fb4825eSEdward Creephysical port's representor) but does not generalise to devices with multiple
946fb4825eSEdward Creeports or uplinks.
956fb4825eSEdward Cree
966fb4825eSEdward CreeThus, the following should all have representors:
976fb4825eSEdward Cree
986fb4825eSEdward Cree - VFs belonging to the switchdev function.
996fb4825eSEdward Cree - Other PFs on the local PCIe controller, and any VFs belonging to them.
1006fb4825eSEdward Cree - PFs and VFs on external PCIe controllers on the device (e.g. for any embedded
1016fb4825eSEdward Cree   System-on-Chip within the SmartNIC).
1026fb4825eSEdward Cree - PFs and VFs with other personalities, including network block devices (such
1036fb4825eSEdward Cree   as a vDPA virtio-blk PF backed by remote/distributed storage), if (and only
1046fb4825eSEdward Cree   if) their network access is implemented through a virtual switch port. [#]_
1056fb4825eSEdward Cree   Note that such functions can require a representor despite the representee
1066fb4825eSEdward Cree   not having a netdev.
1076fb4825eSEdward Cree - Subfunctions (SFs) belonging to any of the above PFs or VFs, if they have
1086fb4825eSEdward Cree   their own port on the switch (as opposed to using their parent PF's port).
1096fb4825eSEdward Cree - Any accelerators or plugins on the device whose interface to the network is
1106fb4825eSEdward Cree   through a virtual switch port, even if they do not have a corresponding PCIe
1116fb4825eSEdward Cree   PF or VF.
1126fb4825eSEdward Cree
1136fb4825eSEdward CreeThis allows the entire switching behaviour of the NIC to be controlled through
1146fb4825eSEdward Creerepresentor TC rules.
1156fb4825eSEdward Cree
1166fb4825eSEdward CreeIt is a common misunderstanding to conflate virtual ports with PCIe virtual
1176fb4825eSEdward Creefunctions or their netdevs.  While in simple cases there will be a 1:1
1186fb4825eSEdward Creecorrespondence between VF netdevices and VF representors, more advanced device
1196fb4825eSEdward Creeconfigurations may not follow this.
1206fb4825eSEdward CreeA PCIe function which does not have network access through the internal switch
1216fb4825eSEdward Cree(not even indirectly through the hardware implementation of whatever services
1226fb4825eSEdward Creethe function provides) should *not* have a representor (even if it has a
1236fb4825eSEdward Creenetdev).
1246fb4825eSEdward CreeSuch a function has no switch virtual port for the representor to configure or
1256fb4825eSEdward Creeto be the other end of the virtual pipe.
1266fb4825eSEdward CreeThe representor represents the virtual port, not the PCIe function nor the 'end
1276fb4825eSEdward Creeuser' netdevice.
1286fb4825eSEdward Cree
1296fb4825eSEdward Cree.. [#] The concept here is that a hardware IP stack in the device performs the
1306fb4825eSEdward Cree   translation between block DMA requests and network packets, so that only
1316fb4825eSEdward Cree   network packets pass through the virtual port onto the switch.  The network
1326fb4825eSEdward Cree   access that the IP stack "sees" would then be configurable through tc rules;
1336fb4825eSEdward Cree   e.g. its traffic might all be wrapped in a specific VLAN or VxLAN.  However,
1346fb4825eSEdward Cree   any needed configuration of the block device *qua* block device, not being a
1356fb4825eSEdward Cree   networking entity, would not be appropriate for the representor and would
1366fb4825eSEdward Cree   thus use some other channel such as devlink.
1376fb4825eSEdward Cree   Contrast this with the case of a virtio-blk implementation which forwards the
1386fb4825eSEdward Cree   DMA requests unchanged to another PF whose driver then initiates and
1396fb4825eSEdward Cree   terminates IP traffic in software; in that case the DMA traffic would *not*
1406fb4825eSEdward Cree   run over the virtual switch and the virtio-blk PF should thus *not* have a
1416fb4825eSEdward Cree   representor.
1426fb4825eSEdward Cree
1436fb4825eSEdward CreeHow are representors created?
1446fb4825eSEdward Cree-----------------------------
1456fb4825eSEdward Cree
1466fb4825eSEdward CreeThe driver instance attached to the switchdev function should, for each virtual
1476fb4825eSEdward Creeport on the switch, create a pure-software netdevice which has some form of
1486fb4825eSEdward Creein-kernel reference to the switchdev function's own netdevice or driver private
1496fb4825eSEdward Creedata (``netdev_priv()``).
1506fb4825eSEdward CreeThis may be by enumerating ports at probe time, reacting dynamically to the
1516fb4825eSEdward Creecreation and destruction of ports at run time, or a combination of the two.
1526fb4825eSEdward Cree
1536fb4825eSEdward CreeThe operations of the representor netdevice will generally involve acting
1546fb4825eSEdward Creethrough the switchdev function.  For example, ``ndo_start_xmit()`` might send
1556fb4825eSEdward Creethe packet through a hardware TX queue attached to the switchdev function, with
1566fb4825eSEdward Creeeither packet metadata or queue configuration marking it for delivery to the
1576fb4825eSEdward Creerepresentee.
1586fb4825eSEdward Cree
1596fb4825eSEdward CreeHow are representors identified?
1606fb4825eSEdward Cree--------------------------------
1616fb4825eSEdward Cree
1626fb4825eSEdward CreeThe representor netdevice should *not* directly refer to a PCIe device (e.g.
1636fb4825eSEdward Creethrough ``net_dev->dev.parent`` / ``SET_NETDEV_DEV()``), either of the
1646fb4825eSEdward Creerepresentee or of the switchdev function.
165*a258c804SMateusz PolchlopekInstead, the driver should use the ``SET_NETDEV_DEVLINK_PORT`` macro to
166*a258c804SMateusz Polchlopekassign a devlink port instance to the netdevice before registering the
167*a258c804SMateusz Polchlopeknetdevice; the kernel uses the devlink port to provide the ``phys_switch_id``
168*a258c804SMateusz Polchlopekand ``phys_port_name`` sysfs nodes.
169*a258c804SMateusz Polchlopek(Some legacy drivers implement ``ndo_get_port_parent_id()`` and
1706fb4825eSEdward Cree``ndo_get_phys_port_name()`` directly, but this is deprecated.)  See
1716fb4825eSEdward Cree:ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>` for the
1726fb4825eSEdward Creedetails of this API.
1736fb4825eSEdward Cree
1746fb4825eSEdward CreeIt is expected that userland will use this information (e.g. through udev rules)
1756fb4825eSEdward Creeto construct an appropriately informative name or alias for the netdevice.  For
1766fb4825eSEdward Creeinstance if the switchdev function is ``eth4`` then a representor with a
1776fb4825eSEdward Cree``phys_port_name`` of ``p0pf1vf2`` might be renamed ``eth4pf1vf2rep``.
1786fb4825eSEdward Cree
1796fb4825eSEdward CreeThere are as yet no established conventions for naming representors which do not
1806fb4825eSEdward Creecorrespond to PCIe functions (e.g. accelerators and plugins).
1816fb4825eSEdward Cree
1826fb4825eSEdward CreeHow do representors interact with TC rules?
1836fb4825eSEdward Cree-------------------------------------------
1846fb4825eSEdward Cree
1856fb4825eSEdward CreeAny TC rule on a representor applies (in software TC) to packets received by
1866fb4825eSEdward Creethat representor netdevice.  Thus, if the delivery part of the rule corresponds
1876fb4825eSEdward Creeto another port on the virtual switch, the driver may choose to offload it to
1886fb4825eSEdward Creehardware, applying it to packets transmitted by the representee.
1896fb4825eSEdward Cree
1906fb4825eSEdward CreeSimilarly, since a TC mirred egress action targeting the representor would (in
1916fb4825eSEdward Creesoftware) send the packet through the representor (and thus indirectly deliver
1926fb4825eSEdward Creeit to the representee), hardware offload should interpret this as delivery to
1936fb4825eSEdward Creethe representee.
1946fb4825eSEdward Cree
1956fb4825eSEdward CreeAs a simple example, if ``PORT_DEV`` is the physical port representor and
1966fb4825eSEdward Cree``REP_DEV`` is a VF representor, the following rules::
1976fb4825eSEdward Cree
1986fb4825eSEdward Cree    tc filter add dev $REP_DEV parent ffff: protocol ipv4 flower \
1996fb4825eSEdward Cree        action mirred egress redirect dev $PORT_DEV
2006fb4825eSEdward Cree    tc filter add dev $PORT_DEV parent ffff: protocol ipv4 flower skip_sw \
2016fb4825eSEdward Cree        action mirred egress mirror dev $REP_DEV
2026fb4825eSEdward Cree
2036fb4825eSEdward Creewould mean that all IPv4 packets from the VF are sent out the physical port, and
2046fb4825eSEdward Creeall IPv4 packets received on the physical port are delivered to the VF in
2056fb4825eSEdward Creeaddition to ``PORT_DEV``.  (Note that without ``skip_sw`` on the second rule,
2066fb4825eSEdward Creethe VF would get two copies, as the packet reception on ``PORT_DEV`` would
2076fb4825eSEdward Creetrigger the TC rule again and mirror the packet to ``REP_DEV``.)
2086fb4825eSEdward Cree
2096fb4825eSEdward CreeOn devices without separate port and uplink representors, ``PORT_DEV`` would
2106fb4825eSEdward Creeinstead be the switchdev function's own uplink netdevice.
2116fb4825eSEdward Cree
2126fb4825eSEdward CreeOf course the rules can (if supported by the NIC) include packet-modifying
2136fb4825eSEdward Creeactions (e.g. VLAN push/pop), which should be performed by the virtual switch.
2146fb4825eSEdward Cree
2156fb4825eSEdward CreeTunnel encapsulation and decapsulation are rather more complicated, as they
2166fb4825eSEdward Creeinvolve a third netdevice (a tunnel netdev operating in metadata mode, such as
2176fb4825eSEdward Creea VxLAN device created with ``ip link add vxlan0 type vxlan external``) and
2186fb4825eSEdward Creerequire an IP address to be bound to the underlay device (e.g. switchdev
2196fb4825eSEdward Creefunction uplink netdev or port representor).  TC rules such as::
2206fb4825eSEdward Cree
2216fb4825eSEdward Cree    tc filter add dev $REP_DEV parent ffff: flower \
2226fb4825eSEdward Cree        action tunnel_key set id $VNI src_ip $LOCAL_IP dst_ip $REMOTE_IP \
2236fb4825eSEdward Cree                              dst_port 4789 \
2246fb4825eSEdward Cree        action mirred egress redirect dev vxlan0
2256fb4825eSEdward Cree    tc filter add dev vxlan0 parent ffff: flower enc_src_ip $REMOTE_IP \
2266fb4825eSEdward Cree        enc_dst_ip $LOCAL_IP enc_key_id $VNI enc_dst_port 4789 \
2276fb4825eSEdward Cree        action tunnel_key unset action mirred egress redirect dev $REP_DEV
2286fb4825eSEdward Cree
2296fb4825eSEdward Creewhere ``LOCAL_IP`` is an IP address bound to ``PORT_DEV``, and ``REMOTE_IP`` is
2306fb4825eSEdward Creeanother IP address on the same subnet, mean that packets sent by the VF should
2316fb4825eSEdward Creebe VxLAN encapsulated and sent out the physical port (the driver has to deduce
2326fb4825eSEdward Creethis by a route lookup of ``LOCAL_IP`` leading to ``PORT_DEV``, and also
2336fb4825eSEdward Creeperform an ARP/neighbour table lookup to find the MAC addresses to use in the
2346fb4825eSEdward Creeouter Ethernet frame), while UDP packets received on the physical port with UDP
2356fb4825eSEdward Creeport 4789 should be parsed as VxLAN and, if their VSID matches ``$VNI``,
2366fb4825eSEdward Creedecapsulated and forwarded to the VF.
2376fb4825eSEdward Cree
2386fb4825eSEdward CreeIf this all seems complicated, just remember the 'golden rule' of TC offload:
2396fb4825eSEdward Creethe hardware should ensure the same final results as if the packets were
2406fb4825eSEdward Creeprocessed through the slow path, traversed software TC (except ignoring any
2416fb4825eSEdward Cree``skip_hw`` rules and applying any ``skip_sw`` rules) and were transmitted or
2426fb4825eSEdward Creereceived through the representor netdevices.
2436fb4825eSEdward Cree
2446fb4825eSEdward CreeConfiguring the representee's MAC
2456fb4825eSEdward Cree---------------------------------
2466fb4825eSEdward Cree
2476fb4825eSEdward CreeThe representee's link state is controlled through the representor.  Setting the
2486fb4825eSEdward Creerepresentor administratively UP or DOWN should cause carrier ON or OFF at the
2496fb4825eSEdward Creerepresentee.
2506fb4825eSEdward Cree
2516fb4825eSEdward CreeSetting an MTU on the representor should cause that same MTU to be reported to
2526fb4825eSEdward Creethe representee.
2536fb4825eSEdward Cree(On hardware that allows configuring separate and distinct MTU and MRU values,
2546fb4825eSEdward Creethe representor MTU should correspond to the representee's MRU and vice-versa.)
2556fb4825eSEdward Cree
2566fb4825eSEdward CreeCurrently there is no way to use the representor to set the station permanent
2576fb4825eSEdward CreeMAC address of the representee; other methods available to do this include:
2586fb4825eSEdward Cree
2596fb4825eSEdward Cree - legacy SR-IOV (``ip link set DEVICE vf NUM mac LLADDR``)
2606fb4825eSEdward Cree - devlink port function (see **devlink-port(8)** and
2616fb4825eSEdward Cree   :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`)
262