1.. SPDX-License-Identifier: GPL-2.0 2 3.. _devlink_port: 4 5============ 6Devlink Port 7============ 8 9``devlink-port`` is a port that exists on the device. It has a logically 10separate ingress/egress point of the device. A devlink port can be any one 11of many flavours. A devlink port flavour along with port attributes 12describe what a port represents. 13 14A device driver that intends to publish a devlink port sets the 15devlink port attributes and registers the devlink port. 16 17Devlink port flavours are described below. 18 19.. list-table:: List of devlink port flavours 20 :widths: 33 90 21 22 * - Flavour 23 - Description 24 * - ``DEVLINK_PORT_FLAVOUR_PHYSICAL`` 25 - Any kind of physical port. This can be an eswitch physical port or any 26 other physical port on the device. 27 * - ``DEVLINK_PORT_FLAVOUR_DSA`` 28 - This indicates a DSA interconnect port. 29 * - ``DEVLINK_PORT_FLAVOUR_CPU`` 30 - This indicates a CPU port applicable only to DSA. 31 * - ``DEVLINK_PORT_FLAVOUR_PCI_PF`` 32 - This indicates an eswitch port representing a port of PCI 33 physical function (PF). 34 * - ``DEVLINK_PORT_FLAVOUR_PCI_VF`` 35 - This indicates an eswitch port representing a port of PCI 36 virtual function (VF). 37 * - ``DEVLINK_PORT_FLAVOUR_PCI_SF`` 38 - This indicates an eswitch port representing a port of PCI 39 subfunction (SF). 40 * - ``DEVLINK_PORT_FLAVOUR_VIRTUAL`` 41 - This indicates a virtual port for the PCI virtual function. 42 43Devlink port can have a different type based on the link layer described below. 44 45.. list-table:: List of devlink port types 46 :widths: 23 90 47 48 * - Type 49 - Description 50 * - ``DEVLINK_PORT_TYPE_ETH`` 51 - Driver should set this port type when a link layer of the port is 52 Ethernet. 53 * - ``DEVLINK_PORT_TYPE_IB`` 54 - Driver should set this port type when a link layer of the port is 55 InfiniBand. 56 * - ``DEVLINK_PORT_TYPE_AUTO`` 57 - This type is indicated by the user when driver should detect the port 58 type automatically. 59 60PCI controllers 61--------------- 62In most cases a PCI device has only one controller. A controller consists of 63potentially multiple physical, virtual functions and subfunctions. A function 64consists of one or more ports. This port is represented by the devlink eswitch 65port. 66 67A PCI device connected to multiple CPUs or multiple PCI root complexes or a 68SmartNIC, however, may have multiple controllers. For a device with multiple 69controllers, each controller is distinguished by a unique controller number. 70An eswitch is on the PCI device which supports ports of multiple controllers. 71 72An example view of a system with two controllers:: 73 74 --------------------------------------------------------- 75 | | 76 | --------- --------- ------- ------- | 77 ----------- | | vf(s) | | sf(s) | |vf(s)| |sf(s)| | 78 | server | | ------- ----/---- ---/----- ------- ---/--- ---/--- | 79 | pci rc |=== | pf0 |______/________/ | pf1 |___/_______/ | 80 | connect | | ------- ------- | 81 ----------- | | controller_num=1 (no eswitch) | 82 ------|-------------------------------------------------- 83 (internal wire) 84 | 85 --------------------------------------------------------- 86 | devlink eswitch ports and reps | 87 | ----------------------------------------------------- | 88 | |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | | 89 | |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | | 90 | ----------------------------------------------------- | 91 | |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | | 92 | |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | | 93 | ----------------------------------------------------- | 94 | | 95 | | 96 ----------- | --------- --------- ------- ------- | 97 | smartNIC| | | vf(s) | | sf(s) | |vf(s)| |sf(s)| | 98 | pci rc |==| ------- ----/---- ---/----- ------- ---/--- ---/--- | 99 | connect | | | pf0 |______/________/ | pf1 |___/_______/ | 100 ----------- | ------- ------- | 101 | | 102 | local controller_num=0 (eswitch) | 103 --------------------------------------------------------- 104 105In the above example, the external controller (identified by controller number = 1) 106doesn't have the eswitch. Local controller (identified by controller number = 0) 107has the eswitch. The Devlink instance on the local controller has eswitch 108devlink ports for both the controllers. 109 110Function configuration 111====================== 112 113Users can configure one or more function attributes before enumerating the PCI 114function. Usually it means, user should configure function attribute 115before a bus specific device for the function is created. However, when 116SRIOV is enabled, virtual function devices are created on the PCI bus. 117Hence, function attribute should be configured before binding virtual 118function device to the driver. For subfunctions, this means user should 119configure port function attribute before activating the port function. 120 121A user may set the hardware address of the function using 122`devlink port function set hw_addr` command. For Ethernet port function 123this means a MAC address. 124 125Users may also set the RoCE capability of the function using 126`devlink port function set roce` command. 127 128Users may also set the function as migratable using 129'devlink port function set migratable' command. 130 131Users may also set the IPsec crypto capability of the function using 132`devlink port function set ipsec_crypto` command. 133 134Users may also set the IPsec packet capability of the function using 135`devlink port function set ipsec_packet` command. 136 137Function attributes 138=================== 139 140MAC address setup 141----------------- 142The configured MAC address of the PCI VF/SF will be used by netdevice and rdma 143device created for the PCI VF/SF. 144 145- Get the MAC address of the VF identified by its unique devlink port index:: 146 147 $ devlink port show pci/0000:06:00.0/2 148 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 149 function: 150 hw_addr 00:00:00:00:00:00 151 152- Set the MAC address of the VF identified by its unique devlink port index:: 153 154 $ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55 155 156 $ devlink port show pci/0000:06:00.0/2 157 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 158 function: 159 hw_addr 00:11:22:33:44:55 160 161- Get the MAC address of the SF identified by its unique devlink port index:: 162 163 $ devlink port show pci/0000:06:00.0/32768 164 pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88 165 function: 166 hw_addr 00:00:00:00:00:00 167 168- Set the MAC address of the SF identified by its unique devlink port index:: 169 170 $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88 171 172 $ devlink port show pci/0000:06:00.0/32768 173 pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88 174 function: 175 hw_addr 00:00:00:00:88:88 176 177RoCE capability setup 178--------------------- 179Not all PCI VFs/SFs require RoCE capability. 180 181When RoCE capability is disabled, it saves system memory per PCI VF/SF. 182 183When user disables RoCE capability for a VF/SF, user application cannot send or 184receive any RoCE packets through this VF/SF and RoCE GID table for this PCI 185will be empty. 186 187When RoCE capability is disabled in the device using port function attribute, 188VF/SF driver cannot override it. 189 190- Get RoCE capability of the VF device:: 191 192 $ devlink port show pci/0000:06:00.0/2 193 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 194 function: 195 hw_addr 00:00:00:00:00:00 roce enable 196 197- Set RoCE capability of the VF device:: 198 199 $ devlink port function set pci/0000:06:00.0/2 roce disable 200 201 $ devlink port show pci/0000:06:00.0/2 202 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 203 function: 204 hw_addr 00:00:00:00:00:00 roce disable 205 206migratable capability setup 207--------------------------- 208Live migration is the process of transferring a live virtual machine 209from one physical host to another without disrupting its normal 210operation. 211 212User who want PCI VFs to be able to perform live migration need to 213explicitly enable the VF migratable capability. 214 215When user enables migratable capability for a VF, and the HV binds the VF to VFIO driver 216with migration support, the user can migrate the VM with this VF from one HV to a 217different one. 218 219However, when migratable capability is enable, device will disable features which cannot 220be migrated. Thus migratable cap can impose limitations on a VF so let the user decide. 221 222Example of LM with migratable function configuration: 223- Get migratable capability of the VF device:: 224 225 $ devlink port show pci/0000:06:00.0/2 226 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 227 function: 228 hw_addr 00:00:00:00:00:00 migratable disable 229 230- Set migratable capability of the VF device:: 231 232 $ devlink port function set pci/0000:06:00.0/2 migratable enable 233 234 $ devlink port show pci/0000:06:00.0/2 235 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 236 function: 237 hw_addr 00:00:00:00:00:00 migratable enable 238 239- Bind VF to VFIO driver with migration support:: 240 241 $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/unbind 242 $ echo mlx5_vfio_pci > /sys/bus/pci/devices/0000:08:00.0/driver_override 243 $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/bind 244 245Attach VF to the VM. 246Start the VM. 247Perform live migration. 248 249IPsec crypto capability setup 250----------------------------- 251When user enables IPsec crypto capability for a VF, user application can offload 252XFRM state crypto operation (Encrypt/Decrypt) to this VF. 253 254When IPsec crypto capability is disabled (default) for a VF, the XFRM state is 255processed in software by the kernel. 256 257- Get IPsec crypto capability of the VF device:: 258 259 $ devlink port show pci/0000:06:00.0/2 260 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 261 function: 262 hw_addr 00:00:00:00:00:00 ipsec_crypto disabled 263 264- Set IPsec crypto capability of the VF device:: 265 266 $ devlink port function set pci/0000:06:00.0/2 ipsec_crypto enable 267 268 $ devlink port show pci/0000:06:00.0/2 269 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 270 function: 271 hw_addr 00:00:00:00:00:00 ipsec_crypto enabled 272 273IPsec packet capability setup 274----------------------------- 275When user enables IPsec packet capability for a VF, user application can offload 276XFRM state and policy crypto operation (Encrypt/Decrypt) to this VF, as well as 277IPsec encapsulation. 278 279When IPsec packet capability is disabled (default) for a VF, the XFRM state and 280policy is processed in software by the kernel. 281 282- Get IPsec packet capability of the VF device:: 283 284 $ devlink port show pci/0000:06:00.0/2 285 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 286 function: 287 hw_addr 00:00:00:00:00:00 ipsec_packet disabled 288 289- Set IPsec packet capability of the VF device:: 290 291 $ devlink port function set pci/0000:06:00.0/2 ipsec_packet enable 292 293 $ devlink port show pci/0000:06:00.0/2 294 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 295 function: 296 hw_addr 00:00:00:00:00:00 ipsec_packet enabled 297 298Subfunction 299============ 300 301Subfunction is a lightweight function that has a parent PCI function on which 302it is deployed. Subfunction is created and deployed in unit of 1. Unlike 303SRIOV VFs, a subfunction doesn't require its own PCI virtual function. 304A subfunction communicates with the hardware through the parent PCI function. 305 306To use a subfunction, 3 steps setup sequence is followed: 307 3081) create - create a subfunction; 3092) configure - configure subfunction attributes; 3103) deploy - deploy the subfunction; 311 312Subfunction management is done using devlink port user interface. 313User performs setup on the subfunction management device. 314 315(1) Create 316---------- 317A subfunction is created using a devlink port interface. A user adds the 318subfunction by adding a devlink port of subfunction flavour. The devlink 319kernel code calls down to subfunction management driver (devlink ops) and asks 320it to create a subfunction devlink port. Driver then instantiates the 321subfunction port and any associated objects such as health reporters and 322representor netdevice. 323 324(2) Configure 325------------- 326A subfunction devlink port is created but it is not active yet. That means the 327entities are created on devlink side, the e-switch port representor is created, 328but the subfunction device itself is not created. A user might use e-switch port 329representor to do settings, putting it into bridge, adding TC rules, etc. A user 330might as well configure the hardware address (such as MAC address) of the 331subfunction while subfunction is inactive. 332 333(3) Deploy 334---------- 335Once a subfunction is configured, user must activate it to use it. Upon 336activation, subfunction management driver asks the subfunction management 337device to instantiate the subfunction device on particular PCI function. 338A subfunction device is created on the :ref:`Documentation/driver-api/auxiliary_bus.rst <auxiliary_bus>`. 339At this point a matching subfunction driver binds to the subfunction's auxiliary device. 340 341Rate object management 342====================== 343 344Devlink provides API to manage tx rates of single devlink port or a group. 345This is done through rate objects, which can be one of the two types: 346 347``leaf`` 348 Represents a single devlink port; created/destroyed by the driver. Since leaf 349 have 1to1 mapping to its devlink port, in user space it is referred as 350 ``pci/<bus_addr>/<port_index>``; 351 352``node`` 353 Represents a group of rate objects (leafs and/or nodes); created/deleted by 354 request from the userspace; initially empty (no rate objects added). In 355 userspace it is referred as ``pci/<bus_addr>/<node_name>``, where 356 ``node_name`` can be any identifier, except decimal number, to avoid 357 collisions with leafs. 358 359API allows to configure following rate object's parameters: 360 361``tx_share`` 362 Minimum TX rate value shared among all other rate objects, or rate objects 363 that parts of the parent group, if it is a part of the same group. 364 365``tx_max`` 366 Maximum TX rate value. 367 368``tx_priority`` 369 Allows for usage of strict priority arbiter among siblings. This 370 arbitration scheme attempts to schedule nodes based on their priority 371 as long as the nodes remain within their bandwidth limit. The higher the 372 priority the higher the probability that the node will get selected for 373 scheduling. 374 375``tx_weight`` 376 Allows for usage of Weighted Fair Queuing arbitration scheme among 377 siblings. This arbitration scheme can be used simultaneously with the 378 strict priority. As a node is configured with a higher rate it gets more 379 BW relative to its siblings. Values are relative like a percentage 380 points, they basically tell how much BW should node take relative to 381 its siblings. 382 383``parent`` 384 Parent node name. Parent node rate limits are considered as additional limits 385 to all node children limits. ``tx_max`` is an upper limit for children. 386 ``tx_share`` is a total bandwidth distributed among children. 387 388``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case 389nodes with the same priority form a WFQ subgroup in the sibling group 390and arbitration among them is based on assigned weights. 391 392Arbitration flow from the high level: 393 394#. Choose a node, or group of nodes with the highest priority that stays 395 within the BW limit and are not blocked. Use ``tx_priority`` as a 396 parameter for this arbitration. 397 398#. If group of nodes have the same priority perform WFQ arbitration on 399 that subgroup. Use ``tx_weight`` as a parameter for this arbitration. 400 401#. Select the winner node, and continue arbitration flow among its children, 402 until leaf node is reached, and the winner is established. 403 404#. If all the nodes from the highest priority sub-group are satisfied, or 405 overused their assigned BW, move to the lower priority nodes. 406 407Driver implementations are allowed to support both or either rate object types 408and setting methods of their parameters. Additionally driver implementation 409may export nodes/leafs and their child-parent relationships. 410 411Terms and Definitions 412===================== 413 414.. list-table:: Terms and Definitions 415 :widths: 22 90 416 417 * - Term 418 - Definitions 419 * - ``PCI device`` 420 - A physical PCI device having one or more PCI buses consists of one or 421 more PCI controllers. 422 * - ``PCI controller`` 423 - A controller consists of potentially multiple physical functions, 424 virtual functions and subfunctions. 425 * - ``Port function`` 426 - An object to manage the function of a port. 427 * - ``Subfunction`` 428 - A lightweight function that has parent PCI function on which it is 429 deployed. 430 * - ``Subfunction device`` 431 - A bus device of the subfunction, usually on a auxiliary bus. 432 * - ``Subfunction driver`` 433 - A device driver for the subfunction auxiliary device. 434 * - ``Subfunction management device`` 435 - A PCI physical function that supports subfunction management. 436 * - ``Subfunction management driver`` 437 - A device driver for PCI physical function that supports 438 subfunction management using devlink port interface. 439 * - ``Subfunction host driver`` 440 - A device driver for PCI physical function that hosts subfunction 441 devices. In most cases it is same as subfunction management driver. When 442 subfunction is used on external controller, subfunction management and 443 host drivers are different. 444