xref: /openbmc/linux/Documentation/networking/devlink/devlink-health.rst (revision 9fa996c5f003beae0d8ca323caf06a2b73e471ec)
1.. SPDX-License-Identifier: GPL-2.0
2
3==============
4Devlink Health
5==============
6
7Background
8==========
9
10The ``devlink`` health mechanism is targeted for Real Time Alerting, in
11order to know when something bad happened to a PCI device.
12
13  * Provide alert debug information.
14  * Self healing.
15  * If problem needs vendor support, provide a way to gather all needed
16    debugging information.
17
18Overview
19========
20
21The main idea is to unify and centralize driver health reports in the
22generic ``devlink`` instance and allow the user to set different
23attributes of the health reporting and recovery procedures.
24
25The ``devlink`` health reporter:
26Device driver creates a "health reporter" per each error/health type.
27Error/Health type can be a known/generic (e.g. PCI error, fw error, rx/tx error)
28or unknown (driver specific).
29For each registered health reporter a driver can issue error/health reports
30asynchronously. All health reports handling is done by ``devlink``.
31Device driver can provide specific callbacks for each "health reporter", e.g.:
32
33  * Recovery procedures
34  * Diagnostics procedures
35  * Object dump procedures
36  * OOB initial parameters
37
38Different parts of the driver can register different types of health reporters
39with different handlers.
40
41Actions
42=======
43
44Once an error is reported, devlink health will perform the following actions:
45
46  * A log is being send to the kernel trace events buffer
47  * Health status and statistics are being updated for the reporter instance
48  * Object dump is being taken and saved at the reporter instance (as long as
49    there is no other dump which is already stored)
50  * Auto recovery attempt is being done. Depends on:
51
52    - Auto-recovery configuration
53    - Grace period vs. time passed since last recover
54
55User Interface
56==============
57
58User can access/change each reporter's parameters and driver specific callbacks
59via ``devlink``, e.g per error type (per health reporter):
60
61  * Configure reporter's generic parameters (like: disable/enable auto recovery)
62  * Invoke recovery procedure
63  * Run diagnostics
64  * Object dump
65
66.. list-table:: List of devlink health interfaces
67   :widths: 10 90
68
69   * - Name
70     - Description
71   * - ``DEVLINK_CMD_HEALTH_REPORTER_GET``
72     - Retrieves status and configuration info per DEV and reporter.
73   * - ``DEVLINK_CMD_HEALTH_REPORTER_SET``
74     - Allows reporter-related configuration setting.
75   * - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER``
76     - Triggers reporter's recovery procedure.
77   * - ``DEVLINK_CMD_HEALTH_REPORTER_TEST``
78     - Triggers a fake health event on the reporter. The effects of the test
79       event in terms of recovery flow should follow closely that of a real
80       event.
81   * - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE``
82     - Retrieves current device state related to the reporter.
83   * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET``
84     - Retrieves the last stored dump. Devlink health
85       saves a single dump. If an dump is not already stored by devlink
86       for this reporter, devlink generates a new dump.
87       Dump output is defined by the reporter.
88   * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR``
89     - Clears the last saved dump file for the specified reporter.
90
91The following diagram provides a general overview of ``devlink-health``::
92
93                                                   netlink
94                                          +--------------------------+
95                                          |                          |
96                                          |            +             |
97                                          |            |             |
98                                          +--------------------------+
99                                                       |request for ops
100                                                       |(diagnose,
101      driver                               devlink     |recover,
102                                                       |dump)
103    +--------+                            +--------------------------+
104    |        |                            |    reporter|             |
105    |        |                            |  +---------v----------+  |
106    |        |   ops execution            |  |                    |  |
107    |     <----------------------------------+                    |  |
108    |        |                            |  |                    |  |
109    |        |                            |  + ^------------------+  |
110    |        |                            |    | request for ops     |
111    |        |                            |    | (recover, dump)     |
112    |        |                            |    |                     |
113    |        |                            |  +-+------------------+  |
114    |        |     health report          |  | health handler     |  |
115    |        +------------------------------->                    |  |
116    |        |                            |  +--------------------+  |
117    |        |     health reporter create |                          |
118    |        +---------------------------->                          |
119    +--------+                            +--------------------------+
120