1.. SPDX-License-Identifier: GPL-2.0
2
3.. _devlink_port:
4
5============
6Devlink Port
7============
8
9``devlink-port`` is a port that exists on the device. It has a logically
10separate ingress/egress point of the device. A devlink port can be any one
11of many flavours. A devlink port flavour along with port attributes
12describe what a port represents.
13
14A device driver that intends to publish a devlink port sets the
15devlink port attributes and registers the devlink port.
16
17Devlink port flavours are described below.
18
19.. list-table:: List of devlink port flavours
20   :widths: 33 90
21
22   * - Flavour
23     - Description
24   * - ``DEVLINK_PORT_FLAVOUR_PHYSICAL``
25     - Any kind of physical port. This can be an eswitch physical port or any
26       other physical port on the device.
27   * - ``DEVLINK_PORT_FLAVOUR_DSA``
28     - This indicates a DSA interconnect port.
29   * - ``DEVLINK_PORT_FLAVOUR_CPU``
30     - This indicates a CPU port applicable only to DSA.
31   * - ``DEVLINK_PORT_FLAVOUR_PCI_PF``
32     - This indicates an eswitch port representing a port of PCI
33       physical function (PF).
34   * - ``DEVLINK_PORT_FLAVOUR_PCI_VF``
35     - This indicates an eswitch port representing a port of PCI
36       virtual function (VF).
37   * - ``DEVLINK_PORT_FLAVOUR_PCI_SF``
38     - This indicates an eswitch port representing a port of PCI
39       subfunction (SF).
40   * - ``DEVLINK_PORT_FLAVOUR_VIRTUAL``
41     - This indicates a virtual port for the PCI virtual function.
42
43Devlink port can have a different type based on the link layer described below.
44
45.. list-table:: List of devlink port types
46   :widths: 23 90
47
48   * - Type
49     - Description
50   * - ``DEVLINK_PORT_TYPE_ETH``
51     - Driver should set this port type when a link layer of the port is
52       Ethernet.
53   * - ``DEVLINK_PORT_TYPE_IB``
54     - Driver should set this port type when a link layer of the port is
55       InfiniBand.
56   * - ``DEVLINK_PORT_TYPE_AUTO``
57     - This type is indicated by the user when driver should detect the port
58       type automatically.
59
60PCI controllers
61---------------
62In most cases a PCI device has only one controller. A controller consists of
63potentially multiple physical, virtual functions and subfunctions. A function
64consists of one or more ports. This port is represented by the devlink eswitch
65port.
66
67A PCI device connected to multiple CPUs or multiple PCI root complexes or a
68SmartNIC, however, may have multiple controllers. For a device with multiple
69controllers, each controller is distinguished by a unique controller number.
70An eswitch is on the PCI device which supports ports of multiple controllers.
71
72An example view of a system with two controllers::
73
74                 ---------------------------------------------------------
75                 |                                                       |
76                 |           --------- ---------         ------- ------- |
77    -----------  |           | vf(s) | | sf(s) |         |vf(s)| |sf(s)| |
78    | server  |  | -------   ----/---- ---/----- ------- ---/--- ---/--- |
79    | pci rc  |=== | pf0 |______/________/       | pf1 |___/_______/     |
80    | connect |  | -------                       -------                 |
81    -----------  |     | controller_num=1 (no eswitch)                   |
82                 ------|--------------------------------------------------
83                 (internal wire)
84                       |
85                 ---------------------------------------------------------
86                 | devlink eswitch ports and reps                        |
87                 | ----------------------------------------------------- |
88                 | |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | |
89                 | |pf0    | pf0vfN | pf0sfN | pf1    | pf1vfN |pf1sfN | |
90                 | ----------------------------------------------------- |
91                 | |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | |
92                 | |pf0    | pf0vfN | pf0sfN | pf1    | pf1vfN |pf1sfN | |
93                 | ----------------------------------------------------- |
94                 |                                                       |
95                 |                                                       |
96    -----------  |           --------- ---------         ------- ------- |
97    | smartNIC|  |           | vf(s) | | sf(s) |         |vf(s)| |sf(s)| |
98    | pci rc  |==| -------   ----/---- ---/----- ------- ---/--- ---/--- |
99    | connect |  | | pf0 |______/________/       | pf1 |___/_______/     |
100    -----------  | -------                       -------                 |
101                 |                                                       |
102                 |  local controller_num=0 (eswitch)                     |
103                 ---------------------------------------------------------
104
105In the above example, the external controller (identified by controller number = 1)
106doesn't have the eswitch. Local controller (identified by controller number = 0)
107has the eswitch. The Devlink instance on the local controller has eswitch
108devlink ports for both the controllers.
109
110Function configuration
111======================
112
113A user can configure the function attribute before enumerating the PCI
114function. Usually it means, user should configure function attribute
115before a bus specific device for the function is created. However, when
116SRIOV is enabled, virtual function devices are created on the PCI bus.
117Hence, function attribute should be configured before binding virtual
118function device to the driver. For subfunctions, this means user should
119configure port function attribute before activating the port function.
120
121A user may set the hardware address of the function using
122'devlink port function set hw_addr' command. For Ethernet port function
123this means a MAC address.
124
125Subfunction
126============
127
128Subfunction is a lightweight function that has a parent PCI function on which
129it is deployed. Subfunction is created and deployed in unit of 1. Unlike
130SRIOV VFs, a subfunction doesn't require its own PCI virtual function.
131A subfunction communicates with the hardware through the parent PCI function.
132
133To use a subfunction, 3 steps setup sequence is followed:
134
1351) create - create a subfunction;
1362) configure - configure subfunction attributes;
1373) deploy - deploy the subfunction;
138
139Subfunction management is done using devlink port user interface.
140User performs setup on the subfunction management device.
141
142(1) Create
143----------
144A subfunction is created using a devlink port interface. A user adds the
145subfunction by adding a devlink port of subfunction flavour. The devlink
146kernel code calls down to subfunction management driver (devlink ops) and asks
147it to create a subfunction devlink port. Driver then instantiates the
148subfunction port and any associated objects such as health reporters and
149representor netdevice.
150
151(2) Configure
152-------------
153A subfunction devlink port is created but it is not active yet. That means the
154entities are created on devlink side, the e-switch port representor is created,
155but the subfunction device itself is not created. A user might use e-switch port
156representor to do settings, putting it into bridge, adding TC rules, etc. A user
157might as well configure the hardware address (such as MAC address) of the
158subfunction while subfunction is inactive.
159
160(3) Deploy
161----------
162Once a subfunction is configured, user must activate it to use it. Upon
163activation, subfunction management driver asks the subfunction management
164device to instantiate the subfunction device on particular PCI function.
165A subfunction device is created on the :ref:`Documentation/driver-api/auxiliary_bus.rst <auxiliary_bus>`.
166At this point a matching subfunction driver binds to the subfunction's auxiliary device.
167
168Rate object management
169======================
170
171Devlink provides API to manage tx rates of single devlink port or a group.
172This is done through rate objects, which can be one of the two types:
173
174``leaf``
175  Represents a single devlink port; created/destroyed by the driver. Since leaf
176  have 1to1 mapping to its devlink port, in user space it is referred as
177  ``pci/<bus_addr>/<port_index>``;
178
179``node``
180  Represents a group of rate objects (leafs and/or nodes); created/deleted by
181  request from the userspace; initially empty (no rate objects added). In
182  userspace it is referred as ``pci/<bus_addr>/<node_name>``, where
183  ``node_name`` can be any identifier, except decimal number, to avoid
184  collisions with leafs.
185
186API allows to configure following rate object's parameters:
187
188``tx_share``
189  Minimum TX rate value shared among all other rate objects, or rate objects
190  that parts of the parent group, if it is a part of the same group.
191
192``tx_max``
193  Maximum TX rate value.
194
195``tx_priority``
196  Allows for usage of strict priority arbiter among siblings. This
197  arbitration scheme attempts to schedule nodes based on their priority
198  as long as the nodes remain within their bandwidth limit. The higher the
199  priority the higher the probability that the node will get selected for
200  scheduling.
201
202``tx_weight``
203  Allows for usage of Weighted Fair Queuing arbitration scheme among
204  siblings. This arbitration scheme can be used simultaneously with the
205  strict priority. As a node is configured with a higher rate it gets more
206  BW relative to it's siblings. Values are relative like a percentage
207  points, they basically tell how much BW should node take relative to
208  it's siblings.
209
210``parent``
211  Parent node name. Parent node rate limits are considered as additional limits
212  to all node children limits. ``tx_max`` is an upper limit for children.
213  ``tx_share`` is a total bandwidth distributed among children.
214
215``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
216nodes with the same priority form a WFQ subgroup in the sibling group
217and arbitration among them is based on assigned weights.
218
219Arbitration flow from the high level:
220
221#. Choose a node, or group of nodes with the highest priority that stays
222   within the BW limit and are not blocked. Use ``tx_priority`` as a
223   parameter for this arbitration.
224
225#. If group of nodes have the same priority perform WFQ arbitration on
226   that subgroup. Use ``tx_weight`` as a parameter for this arbitration.
227
228#. Select the winner node, and continue arbitration flow among it's children,
229   until leaf node is reached, and the winner is established.
230
231#. If all the nodes from the highest priority sub-group are satisfied, or
232   overused their assigned BW, move to the lower priority nodes.
233
234Driver implementations are allowed to support both or either rate object types
235and setting methods of their parameters. Additionally driver implementation
236may export nodes/leafs and their child-parent relationships.
237
238Terms and Definitions
239=====================
240
241.. list-table:: Terms and Definitions
242   :widths: 22 90
243
244   * - Term
245     - Definitions
246   * - ``PCI device``
247     - A physical PCI device having one or more PCI buses consists of one or
248       more PCI controllers.
249   * - ``PCI controller``
250     -  A controller consists of potentially multiple physical functions,
251        virtual functions and subfunctions.
252   * - ``Port function``
253     -  An object to manage the function of a port.
254   * - ``Subfunction``
255     -  A lightweight function that has parent PCI function on which it is
256        deployed.
257   * - ``Subfunction device``
258     -  A bus device of the subfunction, usually on a auxiliary bus.
259   * - ``Subfunction driver``
260     -  A device driver for the subfunction auxiliary device.
261   * - ``Subfunction management device``
262     -  A PCI physical function that supports subfunction management.
263   * - ``Subfunction management driver``
264     -  A device driver for PCI physical function that supports
265        subfunction management using devlink port interface.
266   * - ``Subfunction host driver``
267     -  A device driver for PCI physical function that hosts subfunction
268        devices. In most cases it is same as subfunction management driver. When
269        subfunction is used on external controller, subfunction management and
270        host drivers are different.
271