xref: /openbmc/linux/Documentation/networking/devlink/ice.rst (revision 248ed9e227e6cf59acb1aaf3aa30d530a0232c1a)
1.. SPDX-License-Identifier: GPL-2.0
2
3===================
4ice devlink support
5===================
6
7This document describes the devlink features implemented by the ``ice``
8device driver.
9
10Info versions
11=============
12
13The ``ice`` driver reports the following versions
14
15.. list-table:: devlink info versions implemented
16    :widths: 5 5 5 90
17
18    * - Name
19      - Type
20      - Example
21      - Description
22    * - ``board.id``
23      - fixed
24      - K65390-000
25      - The Product Board Assembly (PBA) identifier of the board.
26    * - ``fw.mgmt``
27      - running
28      - 2.1.7
29      - 3-digit version number of the management firmware running on the
30        Embedded Management Processor of the device. It controls the PHY,
31        link, access to device resources, etc. Intel documentation refers to
32        this as the EMP firmware.
33    * - ``fw.mgmt.api``
34      - running
35      - 1.5.1
36      - 3-digit version number (major.minor.patch) of the API exported over
37        the AdminQ by the management firmware. Used by the driver to
38        identify what commands are supported. Historical versions of the
39        kernel only displayed a 2-digit version number (major.minor).
40    * - ``fw.mgmt.build``
41      - running
42      - 0x305d955f
43      - Unique identifier of the source for the management firmware.
44    * - ``fw.undi``
45      - running
46      - 1.2581.0
47      - Version of the Option ROM containing the UEFI driver. The version is
48        reported in ``major.minor.patch`` format. The major version is
49        incremented whenever a major breaking change occurs, or when the
50        minor version would overflow. The minor version is incremented for
51        non-breaking changes and reset to 1 when the major version is
52        incremented. The patch version is normally 0 but is incremented when
53        a fix is delivered as a patch against an older base Option ROM.
54    * - ``fw.psid.api``
55      - running
56      - 0.80
57      - Version defining the format of the flash contents.
58    * - ``fw.bundle_id``
59      - running
60      - 0x80002ec0
61      - Unique identifier of the firmware image file that was loaded onto
62        the device. Also referred to as the EETRACK identifier of the NVM.
63    * - ``fw.app.name``
64      - running
65      - ICE OS Default Package
66      - The name of the DDP package that is active in the device. The DDP
67        package is loaded by the driver during initialization. Each
68        variation of the DDP package has a unique name.
69    * - ``fw.app``
70      - running
71      - 1.3.1.0
72      - The version of the DDP package that is active in the device. Note
73        that both the name (as reported by ``fw.app.name``) and version are
74        required to uniquely identify the package.
75    * - ``fw.app.bundle_id``
76      - running
77      - 0xc0000001
78      - Unique identifier for the DDP package loaded in the device. Also
79        referred to as the DDP Track ID. Can be used to uniquely identify
80        the specific DDP package.
81    * - ``fw.netlist``
82      - running
83      - 1.1.2000-6.7.0
84      - The version of the netlist module. This module defines the device's
85        Ethernet capabilities and default settings, and is used by the
86        management firmware as part of managing link and device
87        connectivity.
88    * - ``fw.netlist.build``
89      - running
90      - 0xee16ced7
91      - The first 4 bytes of the hash of the netlist module contents.
92
93Flash Update
94============
95
96The ``ice`` driver implements support for flash update using the
97``devlink-flash`` interface. It supports updating the device flash using a
98combined flash image that contains the ``fw.mgmt``, ``fw.undi``, and
99``fw.netlist`` components.
100
101.. list-table:: List of supported overwrite modes
102   :widths: 5 95
103
104   * - Bits
105     - Behavior
106   * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS``
107     - Do not preserve settings stored in the flash components being
108       updated. This includes overwriting the port configuration that
109       determines the number of physical functions the device will
110       initialize with.
111   * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS`` and ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS``
112     - Do not preserve either settings or identifiers. Overwrite everything
113       in the flash with the contents from the provided image, without
114       performing any preservation. This includes overwriting device
115       identifying fields such as the MAC address, VPD area, and device
116       serial number. It is expected that this combination be used with an
117       image customized for the specific device.
118
119The ice hardware does not support overwriting only identifiers while
120preserving settings, and thus ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS`` on its
121own will be rejected. If no overwrite mask is provided, the firmware will be
122instructed to preserve all settings and identifying fields when updating.
123
124Reload
125======
126
127The ``ice`` driver supports activating new firmware after a flash update
128using ``DEVLINK_CMD_RELOAD`` with the ``DEVLINK_RELOAD_ACTION_FW_ACTIVATE``
129action.
130
131.. code:: shell
132
133    $ devlink dev reload pci/0000:01:00.0 reload action fw_activate
134
135The new firmware is activated by issuing a device specific Embedded
136Management Processor reset which requests the device to reset and reload the
137EMP firmware image.
138
139The driver does not currently support reloading the driver via
140``DEVLINK_RELOAD_ACTION_DRIVER_REINIT``.
141
142Port split
143==========
144
145The ``ice`` driver supports port splitting only for port 0, as the FW has
146a predefined set of available port split options for the whole device.
147
148A system reboot is required for port split to be applied.
149
150The following command will select the port split option with 4 ports:
151
152.. code:: shell
153
154    $ devlink port split pci/0000:16:00.0/0 count 4
155
156The list of all available port options will be printed to dynamic debug after
157each ``split`` and ``unsplit`` command. The first option is the default.
158
159.. code:: shell
160
161    ice 0000:16:00.0: Available port split options and max port speeds (Gbps):
162    ice 0000:16:00.0: Status  Split      Quad 0          Quad 1
163    ice 0000:16:00.0:         count  L0  L1  L2  L3  L4  L5  L6  L7
164    ice 0000:16:00.0: Active  2     100   -   -   - 100   -   -   -
165    ice 0000:16:00.0:         2      50   -  50   -   -   -   -   -
166    ice 0000:16:00.0: Pending 4      25  25  25  25   -   -   -   -
167    ice 0000:16:00.0:         4      25  25   -   -  25  25   -   -
168    ice 0000:16:00.0:         8      10  10  10  10  10  10  10  10
169    ice 0000:16:00.0:         1     100   -   -   -   -   -   -   -
170
171There could be multiple FW port options with the same port split count. When
172the same port split count request is issued again, the next FW port option with
173the same port split count will be selected.
174
175``devlink port unsplit`` will select the option with a split count of 1. If
176there is no FW option available with split count 1, you will receive an error.
177
178Regions
179=======
180
181The ``ice`` driver implements the following regions for accessing internal
182device data.
183
184.. list-table:: regions implemented
185    :widths: 15 85
186
187    * - Name
188      - Description
189    * - ``nvm-flash``
190      - The contents of the entire flash chip, sometimes referred to as
191        the device's Non Volatile Memory.
192    * - ``shadow-ram``
193      - The contents of the Shadow RAM, which is loaded from the beginning
194        of the flash. Although the contents are primarily from the flash,
195        this area also contains data generated during device boot which is
196        not stored in flash.
197    * - ``device-caps``
198      - The contents of the device firmware's capabilities buffer. Useful to
199        determine the current state and configuration of the device.
200
201Both the ``nvm-flash`` and ``shadow-ram`` regions can be accessed without a
202snapshot. The ``device-caps`` region requires a snapshot as the contents are
203sent by firmware and can't be split into separate reads.
204
205Users can request an immediate capture of a snapshot for all three regions
206via the ``DEVLINK_CMD_REGION_NEW`` command.
207
208.. code:: shell
209
210    $ devlink region show
211    pci/0000:01:00.0/nvm-flash: size 10485760 snapshot [] max 1
212    pci/0000:01:00.0/device-caps: size 4096 snapshot [] max 10
213
214    $ devlink region new pci/0000:01:00.0/nvm-flash snapshot 1
215    $ devlink region dump pci/0000:01:00.0/nvm-flash snapshot 1
216
217    $ devlink region dump pci/0000:01:00.0/nvm-flash snapshot 1
218    0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
219    0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8
220    0000000000000020 0016 0bb8 0016 1720 0000 0000 c00f 3ffc
221    0000000000000030 bada cce5 bada cce5 bada cce5 bada cce5
222
223    $ devlink region read pci/0000:01:00.0/nvm-flash snapshot 1 address 0 length 16
224    0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
225
226    $ devlink region delete pci/0000:01:00.0/nvm-flash snapshot 1
227
228    $ devlink region new pci/0000:01:00.0/device-caps snapshot 1
229    $ devlink region dump pci/0000:01:00.0/device-caps snapshot 1
230    0000000000000000 01 00 01 00 00 00 00 00 01 00 00 00 00 00 00 00
231    0000000000000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
232    0000000000000020 02 00 02 01 32 03 00 00 0a 00 00 00 25 00 00 00
233    0000000000000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
234    0000000000000040 04 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
235    0000000000000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
236    0000000000000060 05 00 01 00 03 00 00 00 00 00 00 00 00 00 00 00
237    0000000000000070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
238    0000000000000080 06 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
239    0000000000000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
240    00000000000000a0 08 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00
241    00000000000000b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
242    00000000000000c0 12 00 01 00 01 00 00 00 01 00 01 00 00 00 00 00
243    00000000000000d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
244    00000000000000e0 13 00 01 00 00 01 00 00 00 00 00 00 00 00 00 00
245    00000000000000f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
246    0000000000000100 14 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
247    0000000000000110 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
248    0000000000000120 15 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
249    0000000000000130 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
250    0000000000000140 16 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
251    0000000000000150 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
252    0000000000000160 17 00 01 00 06 00 00 00 00 00 00 00 00 00 00 00
253    0000000000000170 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
254    0000000000000180 18 00 01 00 01 00 00 00 01 00 00 00 08 00 00 00
255    0000000000000190 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
256    00000000000001a0 22 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
257    00000000000001b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
258    00000000000001c0 40 00 01 00 00 08 00 00 08 00 00 00 00 00 00 00
259    00000000000001d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
260    00000000000001e0 41 00 01 00 00 08 00 00 00 00 00 00 00 00 00 00
261    00000000000001f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
262    0000000000000200 42 00 01 00 00 08 00 00 00 00 00 00 00 00 00 00
263    0000000000000210 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
264
265    $ devlink region delete pci/0000:01:00.0/device-caps snapshot 1
266
267Devlink Rate
268============
269
270The ``ice`` driver implements devlink-rate API. It allows for offload of
271the Hierarchical QoS to the hardware. It enables user to group Virtual
272Functions in a tree structure and assign supported parameters: tx_share,
273tx_max, tx_priority and tx_weight to each node in a tree. So effectively
274user gains an ability to control how much bandwidth is allocated for each
275VF group. This is later enforced by the HW.
276
277It is assumed that this feature is mutually exclusive with DCB performed
278in FW and ADQ, or any driver feature that would trigger changes in QoS,
279for example creation of the new traffic class. The driver will prevent DCB
280or ADQ configuration if user started making any changes to the nodes using
281devlink-rate API. To configure those features a driver reload is necessary.
282Correspondingly if ADQ or DCB will get configured the driver won't export
283hierarchy at all, or will remove the untouched hierarchy if those
284features are enabled after the hierarchy is exported, but before any
285changes are made.
286
287This feature is also dependent on switchdev being enabled in the system.
288It's required because devlink-rate requires devlink-port objects to be
289present, and those objects are only created in switchdev mode.
290
291If the driver is set to the switchdev mode, it will export internal
292hierarchy the moment VF's are created. Root of the tree is always
293represented by the node_0. This node can't be deleted by the user. Leaf
294nodes and nodes with children also can't be deleted.
295
296.. list-table:: Attributes supported
297    :widths: 15 85
298
299    * - Name
300      - Description
301    * - ``tx_max``
302      - maximum bandwidth to be consumed by the tree Node. Rate Limit is
303        an absolute number specifying a maximum amount of bytes a Node may
304        consume during the course of one second. Rate limit guarantees
305        that a link will not oversaturate the receiver on the remote end
306        and also enforces an SLA between the subscriber and network
307        provider.
308    * - ``tx_share``
309      - minimum bandwidth allocated to a tree node when it is not blocked.
310        It specifies an absolute BW. While tx_max defines the maximum
311        bandwidth the node may consume, the tx_share marks committed BW
312        for the Node.
313    * - ``tx_priority``
314      - allows for usage of strict priority arbiter among siblings. This
315        arbitration scheme attempts to schedule nodes based on their
316        priority as long as the nodes remain within their bandwidth limit.
317        Range 0-7. Nodes with priority 7 have the highest priority and are
318        selected first, while nodes with priority 0 have the lowest
319        priority. Nodes that have the same priority are treated equally.
320    * - ``tx_weight``
321      - allows for usage of Weighted Fair Queuing arbitration scheme among
322        siblings. This arbitration scheme can be used simultaneously with
323        the strict priority. Range 1-200. Only relative values matter for
324        arbitration.
325
326``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
327nodes with the same priority form a WFQ subgroup in the sibling group
328and arbitration among them is based on assigned weights.
329
330.. code:: shell
331
332    # enable switchdev
333    $ devlink dev eswitch set pci/0000:4b:00.0 mode switchdev
334
335    # at this point driver should export internal hierarchy
336    $ echo 2 > /sys/class/net/ens785np0/device/sriov_numvfs
337
338    $ devlink port function rate show
339    pci/0000:4b:00.0/node_25: type node parent node_24
340    pci/0000:4b:00.0/node_24: type node parent node_0
341    pci/0000:4b:00.0/node_32: type node parent node_31
342    pci/0000:4b:00.0/node_31: type node parent node_30
343    pci/0000:4b:00.0/node_30: type node parent node_16
344    pci/0000:4b:00.0/node_19: type node parent node_18
345    pci/0000:4b:00.0/node_18: type node parent node_17
346    pci/0000:4b:00.0/node_17: type node parent node_16
347    pci/0000:4b:00.0/node_14: type node parent node_5
348    pci/0000:4b:00.0/node_5: type node parent node_3
349    pci/0000:4b:00.0/node_13: type node parent node_4
350    pci/0000:4b:00.0/node_12: type node parent node_4
351    pci/0000:4b:00.0/node_11: type node parent node_4
352    pci/0000:4b:00.0/node_10: type node parent node_4
353    pci/0000:4b:00.0/node_9: type node parent node_4
354    pci/0000:4b:00.0/node_8: type node parent node_4
355    pci/0000:4b:00.0/node_7: type node parent node_4
356    pci/0000:4b:00.0/node_6: type node parent node_4
357    pci/0000:4b:00.0/node_4: type node parent node_3
358    pci/0000:4b:00.0/node_3: type node parent node_16
359    pci/0000:4b:00.0/node_16: type node parent node_15
360    pci/0000:4b:00.0/node_15: type node parent node_0
361    pci/0000:4b:00.0/node_2: type node parent node_1
362    pci/0000:4b:00.0/node_1: type node parent node_0
363    pci/0000:4b:00.0/node_0: type node
364    pci/0000:4b:00.0/1: type leaf parent node_25
365    pci/0000:4b:00.0/2: type leaf parent node_25
366
367    # let's create some custom node
368    $ devlink port function rate add pci/0000:4b:00.0/node_custom parent node_0
369
370    # second custom node
371    $ devlink port function rate add pci/0000:4b:00.0/node_custom_1 parent node_custom
372
373    # reassign second VF to newly created branch
374    $ devlink port function rate set pci/0000:4b:00.0/2 parent node_custom_1
375
376    # assign tx_weight to the VF
377    $ devlink port function rate set pci/0000:4b:00.0/2 tx_weight 5
378
379    # assign tx_share to the VF
380    $ devlink port function rate set pci/0000:4b:00.0/2 tx_share 500Mbps
381