1.. SPDX-License-Identifier: GPL-2.0 2 3============== 4Devlink Health 5============== 6 7Background 8========== 9 10The ``devlink`` health mechanism is targeted for Real Time Alerting, in 11order to know when something bad happened to a PCI device. 12 13 * Provide alert debug information. 14 * Self healing. 15 * If problem needs vendor support, provide a way to gather all needed 16 debugging information. 17 18Overview 19======== 20 21The main idea is to unify and centralize driver health reports in the 22generic ``devlink`` instance and allow the user to set different 23attributes of the health reporting and recovery procedures. 24 25The ``devlink`` health reporter: 26Device driver creates a "health reporter" per each error/health type. 27Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error) 28or unknown (driver specific). 29For each registered health reporter a driver can issue error/health reports 30asynchronously. All health reports handling is done by ``devlink``. 31Device driver can provide specific callbacks for each "health reporter", e.g.: 32 33 * Recovery procedures 34 * Diagnostics procedures 35 * Object dump procedures 36 * OOB initial parameters 37 38Different parts of the driver can register different types of health reporters 39with different handlers. 40 41Actions 42======= 43 44Once an error is reported, devlink health will perform the following actions: 45 46 * A log is being send to the kernel trace events buffer 47 * Health status and statistics are being updated for the reporter instance 48 * Object dump is being taken and saved at the reporter instance (as long as 49 there is no other dump which is already stored) 50 * Auto recovery attempt is being done. Depends on: 51 - Auto-recovery configuration 52 - Grace period vs. time passed since last recover 53 54User Interface 55============== 56 57User can access/change each reporter's parameters and driver specific callbacks 58via ``devlink``, e.g per error type (per health reporter): 59 60 * Configure reporter's generic parameters (like: disable/enable auto recovery) 61 * Invoke recovery procedure 62 * Run diagnostics 63 * Object dump 64 65.. list-table:: List of devlink health interfaces 66 :widths: 10 90 67 68 * - Name 69 - Description 70 * - ``DEVLINK_CMD_HEALTH_REPORTER_GET`` 71 - Retrieves status and configuration info per DEV and reporter. 72 * - ``DEVLINK_CMD_HEALTH_REPORTER_SET`` 73 - Allows reporter-related configuration setting. 74 * - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER`` 75 - Triggers a reporter's recovery procedure. 76 * - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE`` 77 - Retrieves diagnostics data from a reporter on a device. 78 * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET`` 79 - Retrieves the last stored dump. Devlink health 80 saves a single dump. If an dump is not already stored by the devlink 81 for this reporter, devlink generates a new dump. 82 dump output is defined by the reporter. 83 * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR`` 84 - Clears the last saved dump file for the specified reporter. 85 86The following diagram provides a general overview of ``devlink-health``:: 87 88 netlink 89 +--------------------------+ 90 | | 91 | + | 92 | | | 93 +--------------------------+ 94 |request for ops 95 |(diagnose, 96 mlx5_core devlink |recover, 97 |dump) 98 +--------+ +--------------------------+ 99 | | | reporter| | 100 | | | +---------v----------+ | 101 | | ops execution | | | | 102 | <----------------------------------+ | | 103 | | | | | | 104 | | | + ^------------------+ | 105 | | | | request for ops | 106 | | | | (recover, dump) | 107 | | | | | 108 | | | +-+------------------+ | 109 | | health report | | health handler | | 110 | +-------------------------------> | | 111 | | | +--------------------+ | 112 | | health reporter create | | 113 | +----------------------------> | 114 +--------+ +--------------------------+ 115