1.. SPDX-License-Identifier: GPL-2.0 2 3============== 4Devlink Health 5============== 6 7Background 8========== 9 10The ``devlink`` health mechanism is targeted for Real Time Alerting, in 11order to know when something bad happened to a PCI device. 12 13 * Provide alert debug information. 14 * Self healing. 15 * If problem needs vendor support, provide a way to gather all needed 16 debugging information. 17 18Overview 19======== 20 21The main idea is to unify and centralize driver health reports in the 22generic ``devlink`` instance and allow the user to set different 23attributes of the health reporting and recovery procedures. 24 25The ``devlink`` health reporter: 26Device driver creates a "health reporter" per each error/health type. 27Error/Health type can be a known/generic (e.g. PCI error, fw error, rx/tx error) 28or unknown (driver specific). 29For each registered health reporter a driver can issue error/health reports 30asynchronously. All health reports handling is done by ``devlink``. 31Device driver can provide specific callbacks for each "health reporter", e.g.: 32 33 * Recovery procedures 34 * Diagnostics procedures 35 * Object dump procedures 36 * OOB initial parameters 37 38Different parts of the driver can register different types of health reporters 39with different handlers. 40 41Actions 42======= 43 44Once an error is reported, devlink health will perform the following actions: 45 46 * A log is being send to the kernel trace events buffer 47 * Health status and statistics are being updated for the reporter instance 48 * Object dump is being taken and saved at the reporter instance (as long as 49 there is no other dump which is already stored) 50 * Auto recovery attempt is being done. Depends on: 51 52 - Auto-recovery configuration 53 - Grace period vs. time passed since last recover 54 55User Interface 56============== 57 58User can access/change each reporter's parameters and driver specific callbacks 59via ``devlink``, e.g per error type (per health reporter): 60 61 * Configure reporter's generic parameters (like: disable/enable auto recovery) 62 * Invoke recovery procedure 63 * Run diagnostics 64 * Object dump 65 66.. list-table:: List of devlink health interfaces 67 :widths: 10 90 68 69 * - Name 70 - Description 71 * - ``DEVLINK_CMD_HEALTH_REPORTER_GET`` 72 - Retrieves status and configuration info per DEV and reporter. 73 * - ``DEVLINK_CMD_HEALTH_REPORTER_SET`` 74 - Allows reporter-related configuration setting. 75 * - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER`` 76 - Triggers reporter's recovery procedure. 77 * - ``DEVLINK_CMD_HEALTH_REPORTER_TEST`` 78 - Triggers a fake health event on the reporter. The effects of the test 79 event in terms of recovery flow should follow closely that of a real 80 event. 81 * - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE`` 82 - Retrieves current device state related to the reporter. 83 * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET`` 84 - Retrieves the last stored dump. Devlink health 85 saves a single dump. If an dump is not already stored by devlink 86 for this reporter, devlink generates a new dump. 87 Dump output is defined by the reporter. 88 * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR`` 89 - Clears the last saved dump file for the specified reporter. 90 91The following diagram provides a general overview of ``devlink-health``:: 92 93 netlink 94 +--------------------------+ 95 | | 96 | + | 97 | | | 98 +--------------------------+ 99 |request for ops 100 |(diagnose, 101 driver devlink |recover, 102 |dump) 103 +--------+ +--------------------------+ 104 | | | reporter| | 105 | | | +---------v----------+ | 106 | | ops execution | | | | 107 | <----------------------------------+ | | 108 | | | | | | 109 | | | + ^------------------+ | 110 | | | | request for ops | 111 | | | | (recover, dump) | 112 | | | | | 113 | | | +-+------------------+ | 114 | | health report | | health handler | | 115 | +-------------------------------> | | 116 | | | +--------------------+ | 117 | | health reporter create | | 118 | +----------------------------> | 119 +--------+ +--------------------------+ 120