xref: /openbmc/linux/Documentation/networking/devlink/devlink-port.rst (revision c0d3b83100c896e1b0909023df58a0ebdd428d61)
1.. SPDX-License-Identifier: GPL-2.0
2
3.. _devlink_port:
4
5============
6Devlink Port
7============
8
9``devlink-port`` is a port that exists on the device. It has a logically
10separate ingress/egress point of the device. A devlink port can be any one
11of many flavours. A devlink port flavour along with port attributes
12describe what a port represents.
13
14A device driver that intends to publish a devlink port sets the
15devlink port attributes and registers the devlink port.
16
17Devlink port flavours are described below.
18
19.. list-table:: List of devlink port flavours
20   :widths: 33 90
21
22   * - Flavour
23     - Description
24   * - ``DEVLINK_PORT_FLAVOUR_PHYSICAL``
25     - Any kind of physical port. This can be an eswitch physical port or any
26       other physical port on the device.
27   * - ``DEVLINK_PORT_FLAVOUR_DSA``
28     - This indicates a DSA interconnect port.
29   * - ``DEVLINK_PORT_FLAVOUR_CPU``
30     - This indicates a CPU port applicable only to DSA.
31   * - ``DEVLINK_PORT_FLAVOUR_PCI_PF``
32     - This indicates an eswitch port representing a port of PCI
33       physical function (PF).
34   * - ``DEVLINK_PORT_FLAVOUR_PCI_VF``
35     - This indicates an eswitch port representing a port of PCI
36       virtual function (VF).
37   * - ``DEVLINK_PORT_FLAVOUR_PCI_SF``
38     - This indicates an eswitch port representing a port of PCI
39       subfunction (SF).
40   * - ``DEVLINK_PORT_FLAVOUR_VIRTUAL``
41     - This indicates a virtual port for the PCI virtual function.
42
43Devlink port can have a different type based on the link layer described below.
44
45.. list-table:: List of devlink port types
46   :widths: 23 90
47
48   * - Type
49     - Description
50   * - ``DEVLINK_PORT_TYPE_ETH``
51     - Driver should set this port type when a link layer of the port is
52       Ethernet.
53   * - ``DEVLINK_PORT_TYPE_IB``
54     - Driver should set this port type when a link layer of the port is
55       InfiniBand.
56   * - ``DEVLINK_PORT_TYPE_AUTO``
57     - This type is indicated by the user when driver should detect the port
58       type automatically.
59
60PCI controllers
61---------------
62In most cases a PCI device has only one controller. A controller consists of
63potentially multiple physical, virtual functions and subfunctions. A function
64consists of one or more ports. This port is represented by the devlink eswitch
65port.
66
67A PCI device connected to multiple CPUs or multiple PCI root complexes or a
68SmartNIC, however, may have multiple controllers. For a device with multiple
69controllers, each controller is distinguished by a unique controller number.
70An eswitch is on the PCI device which supports ports of multiple controllers.
71
72An example view of a system with two controllers::
73
74                 ---------------------------------------------------------
75                 |                                                       |
76                 |           --------- ---------         ------- ------- |
77    -----------  |           | vf(s) | | sf(s) |         |vf(s)| |sf(s)| |
78    | server  |  | -------   ----/---- ---/----- ------- ---/--- ---/--- |
79    | pci rc  |=== | pf0 |______/________/       | pf1 |___/_______/     |
80    | connect |  | -------                       -------                 |
81    -----------  |     | controller_num=1 (no eswitch)                   |
82                 ------|--------------------------------------------------
83                 (internal wire)
84                       |
85                 ---------------------------------------------------------
86                 | devlink eswitch ports and reps                        |
87                 | ----------------------------------------------------- |
88                 | |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | |
89                 | |pf0    | pf0vfN | pf0sfN | pf1    | pf1vfN |pf1sfN | |
90                 | ----------------------------------------------------- |
91                 | |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | |
92                 | |pf0    | pf0vfN | pf0sfN | pf1    | pf1vfN |pf1sfN | |
93                 | ----------------------------------------------------- |
94                 |                                                       |
95                 |                                                       |
96    -----------  |           --------- ---------         ------- ------- |
97    | smartNIC|  |           | vf(s) | | sf(s) |         |vf(s)| |sf(s)| |
98    | pci rc  |==| -------   ----/---- ---/----- ------- ---/--- ---/--- |
99    | connect |  | | pf0 |______/________/       | pf1 |___/_______/     |
100    -----------  | -------                       -------                 |
101                 |                                                       |
102                 |  local controller_num=0 (eswitch)                     |
103                 ---------------------------------------------------------
104
105In the above example, the external controller (identified by controller number = 1)
106doesn't have the eswitch. Local controller (identified by controller number = 0)
107has the eswitch. The Devlink instance on the local controller has eswitch
108devlink ports for both the controllers.
109
110Function configuration
111======================
112
113Users can configure one or more function attributes before enumerating the PCI
114function. Usually it means, user should configure function attribute
115before a bus specific device for the function is created. However, when
116SRIOV is enabled, virtual function devices are created on the PCI bus.
117Hence, function attribute should be configured before binding virtual
118function device to the driver. For subfunctions, this means user should
119configure port function attribute before activating the port function.
120
121A user may set the hardware address of the function using
122`devlink port function set hw_addr` command. For Ethernet port function
123this means a MAC address.
124
125Users may also set the RoCE capability of the function using
126`devlink port function set roce` command.
127
128Users may also set the function as migratable using
129'devlink port function set migratable' command.
130
131Function attributes
132===================
133
134MAC address setup
135-----------------
136The configured MAC address of the PCI VF/SF will be used by netdevice and rdma
137device created for the PCI VF/SF.
138
139- Get the MAC address of the VF identified by its unique devlink port index::
140
141    $ devlink port show pci/0000:06:00.0/2
142    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
143      function:
144        hw_addr 00:00:00:00:00:00
145
146- Set the MAC address of the VF identified by its unique devlink port index::
147
148    $ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55
149
150    $ devlink port show pci/0000:06:00.0/2
151    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
152      function:
153        hw_addr 00:11:22:33:44:55
154
155- Get the MAC address of the SF identified by its unique devlink port index::
156
157    $ devlink port show pci/0000:06:00.0/32768
158    pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
159      function:
160        hw_addr 00:00:00:00:00:00
161
162- Set the MAC address of the SF identified by its unique devlink port index::
163
164    $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88
165
166    $ devlink port show pci/0000:06:00.0/32768
167    pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
168      function:
169        hw_addr 00:00:00:00:88:88
170
171RoCE capability setup
172---------------------
173Not all PCI VFs/SFs require RoCE capability.
174
175When RoCE capability is disabled, it saves system memory per PCI VF/SF.
176
177When user disables RoCE capability for a VF/SF, user application cannot send or
178receive any RoCE packets through this VF/SF and RoCE GID table for this PCI
179will be empty.
180
181When RoCE capability is disabled in the device using port function attribute,
182VF/SF driver cannot override it.
183
184- Get RoCE capability of the VF device::
185
186    $ devlink port show pci/0000:06:00.0/2
187    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
188        function:
189            hw_addr 00:00:00:00:00:00 roce enable
190
191- Set RoCE capability of the VF device::
192
193    $ devlink port function set pci/0000:06:00.0/2 roce disable
194
195    $ devlink port show pci/0000:06:00.0/2
196    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
197        function:
198            hw_addr 00:00:00:00:00:00 roce disable
199
200migratable capability setup
201---------------------------
202Live migration is the process of transferring a live virtual machine
203from one physical host to another without disrupting its normal
204operation.
205
206User who want PCI VFs to be able to perform live migration need to
207explicitly enable the VF migratable capability.
208
209When user enables migratable capability for a VF, and the HV binds the VF to VFIO driver
210with migration support, the user can migrate the VM with this VF from one HV to a
211different one.
212
213However, when migratable capability is enable, device will disable features which cannot
214be migrated. Thus migratable cap can impose limitations on a VF so let the user decide.
215
216Example of LM with migratable function configuration:
217- Get migratable capability of the VF device::
218
219    $ devlink port show pci/0000:06:00.0/2
220    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
221        function:
222            hw_addr 00:00:00:00:00:00 migratable disable
223
224- Set migratable capability of the VF device::
225
226    $ devlink port function set pci/0000:06:00.0/2 migratable enable
227
228    $ devlink port show pci/0000:06:00.0/2
229    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
230        function:
231            hw_addr 00:00:00:00:00:00 migratable enable
232
233- Bind VF to VFIO driver with migration support::
234
235    $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/unbind
236    $ echo mlx5_vfio_pci > /sys/bus/pci/devices/0000:08:00.0/driver_override
237    $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/bind
238
239Attach VF to the VM.
240Start the VM.
241Perform live migration.
242
243Subfunction
244============
245
246Subfunction is a lightweight function that has a parent PCI function on which
247it is deployed. Subfunction is created and deployed in unit of 1. Unlike
248SRIOV VFs, a subfunction doesn't require its own PCI virtual function.
249A subfunction communicates with the hardware through the parent PCI function.
250
251To use a subfunction, 3 steps setup sequence is followed:
252
2531) create - create a subfunction;
2542) configure - configure subfunction attributes;
2553) deploy - deploy the subfunction;
256
257Subfunction management is done using devlink port user interface.
258User performs setup on the subfunction management device.
259
260(1) Create
261----------
262A subfunction is created using a devlink port interface. A user adds the
263subfunction by adding a devlink port of subfunction flavour. The devlink
264kernel code calls down to subfunction management driver (devlink ops) and asks
265it to create a subfunction devlink port. Driver then instantiates the
266subfunction port and any associated objects such as health reporters and
267representor netdevice.
268
269(2) Configure
270-------------
271A subfunction devlink port is created but it is not active yet. That means the
272entities are created on devlink side, the e-switch port representor is created,
273but the subfunction device itself is not created. A user might use e-switch port
274representor to do settings, putting it into bridge, adding TC rules, etc. A user
275might as well configure the hardware address (such as MAC address) of the
276subfunction while subfunction is inactive.
277
278(3) Deploy
279----------
280Once a subfunction is configured, user must activate it to use it. Upon
281activation, subfunction management driver asks the subfunction management
282device to instantiate the subfunction device on particular PCI function.
283A subfunction device is created on the :ref:`Documentation/driver-api/auxiliary_bus.rst <auxiliary_bus>`.
284At this point a matching subfunction driver binds to the subfunction's auxiliary device.
285
286Rate object management
287======================
288
289Devlink provides API to manage tx rates of single devlink port or a group.
290This is done through rate objects, which can be one of the two types:
291
292``leaf``
293  Represents a single devlink port; created/destroyed by the driver. Since leaf
294  have 1to1 mapping to its devlink port, in user space it is referred as
295  ``pci/<bus_addr>/<port_index>``;
296
297``node``
298  Represents a group of rate objects (leafs and/or nodes); created/deleted by
299  request from the userspace; initially empty (no rate objects added). In
300  userspace it is referred as ``pci/<bus_addr>/<node_name>``, where
301  ``node_name`` can be any identifier, except decimal number, to avoid
302  collisions with leafs.
303
304API allows to configure following rate object's parameters:
305
306``tx_share``
307  Minimum TX rate value shared among all other rate objects, or rate objects
308  that parts of the parent group, if it is a part of the same group.
309
310``tx_max``
311  Maximum TX rate value.
312
313``tx_priority``
314  Allows for usage of strict priority arbiter among siblings. This
315  arbitration scheme attempts to schedule nodes based on their priority
316  as long as the nodes remain within their bandwidth limit. The higher the
317  priority the higher the probability that the node will get selected for
318  scheduling.
319
320``tx_weight``
321  Allows for usage of Weighted Fair Queuing arbitration scheme among
322  siblings. This arbitration scheme can be used simultaneously with the
323  strict priority. As a node is configured with a higher rate it gets more
324  BW relative to it's siblings. Values are relative like a percentage
325  points, they basically tell how much BW should node take relative to
326  it's siblings.
327
328``parent``
329  Parent node name. Parent node rate limits are considered as additional limits
330  to all node children limits. ``tx_max`` is an upper limit for children.
331  ``tx_share`` is a total bandwidth distributed among children.
332
333``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
334nodes with the same priority form a WFQ subgroup in the sibling group
335and arbitration among them is based on assigned weights.
336
337Arbitration flow from the high level:
338
339#. Choose a node, or group of nodes with the highest priority that stays
340   within the BW limit and are not blocked. Use ``tx_priority`` as a
341   parameter for this arbitration.
342
343#. If group of nodes have the same priority perform WFQ arbitration on
344   that subgroup. Use ``tx_weight`` as a parameter for this arbitration.
345
346#. Select the winner node, and continue arbitration flow among it's children,
347   until leaf node is reached, and the winner is established.
348
349#. If all the nodes from the highest priority sub-group are satisfied, or
350   overused their assigned BW, move to the lower priority nodes.
351
352Driver implementations are allowed to support both or either rate object types
353and setting methods of their parameters. Additionally driver implementation
354may export nodes/leafs and their child-parent relationships.
355
356Terms and Definitions
357=====================
358
359.. list-table:: Terms and Definitions
360   :widths: 22 90
361
362   * - Term
363     - Definitions
364   * - ``PCI device``
365     - A physical PCI device having one or more PCI buses consists of one or
366       more PCI controllers.
367   * - ``PCI controller``
368     -  A controller consists of potentially multiple physical functions,
369        virtual functions and subfunctions.
370   * - ``Port function``
371     -  An object to manage the function of a port.
372   * - ``Subfunction``
373     -  A lightweight function that has parent PCI function on which it is
374        deployed.
375   * - ``Subfunction device``
376     -  A bus device of the subfunction, usually on a auxiliary bus.
377   * - ``Subfunction driver``
378     -  A device driver for the subfunction auxiliary device.
379   * - ``Subfunction management device``
380     -  A PCI physical function that supports subfunction management.
381   * - ``Subfunction management driver``
382     -  A device driver for PCI physical function that supports
383        subfunction management using devlink port interface.
384   * - ``Subfunction host driver``
385     -  A device driver for PCI physical function that hosts subfunction
386        devices. In most cases it is same as subfunction management driver. When
387        subfunction is used on external controller, subfunction management and
388        host drivers are different.
389