1# Hardware Fault Monitor 2 3Author: Claire Weinan <cweinan@google.com> `daylight22` 4 5Other contributors: 6 7- Heinz Boehmer Fiehn <heinzboehmer@google.com> 8- Drew Walton <acwalton@google.com> 9 10Created: Aug 5, 2021 11 12## Problem Description 13 14The goal is to create a new hardware fault monitor which will provide a 15framework for collecting various fault and sensor information and making it 16available externally via Redfish for data center monitoring and management 17purposes. The information logged would include a wide variety of chipset 18registers and data from manageability hardware. In addition to collecting 19information through BMC interfaces, the hardware fault monitor will also receive 20information via Redfish from the associated host kernel (specifically for cases 21in which the desired information cannot be collected directly by the BMC, for 22example when accessing registers that are read and cleared by the host kernel). 23 24Future expansion of the hardware fault monitor would include adding the means to 25locally analyze fault and sensor information and then based on specified 26criteria trigger repair actions in the host BIOS or kernel. In addition, the 27hardware fault monitor could receive repair action requests via Redfish from 28external data center monitoring software. 29 30## Background and References 31 32The following are a few related existing OpenBMC modules: 33 34- Host Error Monitor logs CPU error information such as CATERR details and takes 35 appropriate actions such as performing resets and collecting crashdumps: 36 <https://github.com/openbmc/host-error-monitor> 37 38- bmcweb implements a Redfish webserver for openbmc: 39 <https://github.com/openbmc/bmcweb>. The Redfish LogService schema is 40 available for logging purposes and the EventService schema is available for a 41 Redfish server to send event notifications to clients. 42 43- Phosphor Debug Collector (phosphor-debug-collector) collects various debug 44 dumps and saves them into files: 45 <https://github.com/openbmc/phosphor-debug-collector> 46 47- Dbus-sensors reads and saves sensor values and makes them available to other 48 modules via D-Bus: <https://github.com/openbmc/dbus-sensors> 49 50- SEL logger logs to the IPMI and Redfish system event logs when certain events 51 happen, such as sensor readings going beyond their thresholds: 52 <https://github.com/openbmc/phosphor-sel-logger> 53 54- FRU fault manager controls the blinking of LEDs when faults occur: 55 <https://github.com/openbmc/phosphor-led-manager/blob/master/fault-monitor/fru-fault-monitor.hpp> 56 57- Guard On BMC records and manages a list of faulty components for isolation. 58 (Both the host and the BMC may identify faulty components and create guard 59 records for them): 60 <https://github.com/openbmc/docs/blob/9c79837a8a20dc8e131cc8f046d1ceb4a731391a/designs/guard-on-bmc.md> 61 62There is an OpenCompute Fault Management Infrastructure proposal that also 63recommends delivering error logs from the BMC: 64<https://drive.google.com/file/d/1A9Qc7hB3THw0wiEK_dbXYj85_NOJWrb5/> 65 66## Requirements 67 68- The users of this solution are Redfish clients in data center software. The 69 goal of the fault monitor is to enable rich error logging (OEM and CPU vendor 70 specific) for data center tools to monitor servers, manage repairs, predict 71 crashes, etc. 72 73- The fault monitor must be able to handle receiving fault information that is 74 polled periodically as well as fault information that may come in sporadically 75 based on fault incidents (e.g. crash dumps). 76 77- The fault monitor should allow for logging of a variety of sizes of fault 78 information entries (on the order of bytes to megabytes). In general, more 79 severe errors which require more fault information to be collected tend to 80 occur less frequently, while less severe errors such as correctable errors 81 require less logging but may happen more frequently. 82 83- Fault information must be added to a Redfish LogService in a timely manner 84 (within a few seconds of the original event) to be available to external data 85 center monitoring software. 86 87- The fault monitor must allow for custom overwrite rules for its log entries 88 (e.g. on overflow, save first errors and more severe errors), or guarantee 89 that enough space is available in its log such that all data from the most 90 recent couple of hours is always kept intact. The log does not have to be 91 stored persistently (though it can be). 92 93## Proposed Design 94 95A generic fault monitor will be created to collect fault information. First we 96discuss a few example use cases: 97 98- On CATERR, the Host Error Monitor requests a crash dump (this is an existing 99 capability). The crash dump includes chipset registers but doesn’t include 100 platform-specific system-level data. The fault monitor would therefore 101 additionally collect system-level data such as clock, thermal, and power 102 information. This information would be bundled, logged, and associated with 103 the crash dump so that it could be post-processed by data center monitoring 104 tools without having to join multiple data sources. 105 106- The fault monitor would monitor link level retries and link retrainings of 107 high speed serial links such as UPI links. This isn’t typically monitored by 108 the host kernel at runtime and the host kernel isn’t able to log it during a 109 crash. The fault monitor in the BMC could check link level retries and link 110 retrainings during runtime by polling over PECI. If a MCERR or IERR occurred, 111 the fault monitor could then add additional information such as high speed 112 serial link statistics to error logs. 113 114- In order to monitor memory out of band, a system could be configured to give 115 the BMC exclusive access to memory error logging registers (to prevent the 116 host kernel from being able to access and clear the registers before the BMC 117 could collect the register data). For corrected memory errors, the fault 118 monitor could log error registers either through polling or interrupts. Data 119 center monitoring tools would use the logs to determine whether memory should 120 be swapped or a machine should be removed from usage. 121 122The fault monitor will not have its own dedicated OpenBMC repository, but will 123consist of components incorporated into the existing repositories 124host-error-monitor, bmcweb, and phosphor-debug-collector. 125 126In the existing Host Error Monitor module, new monitors will be created to add 127functionality needed for the fault monitor. For instance, based on the needs of 128the OEM, the fault monitor will register to be notified of D-Bus signals of 129interest in order to be alerted when fault events occur. The fault monitor will 130also poll registers of interest and log their values to the fault log (described 131more later). In addition, the host will be able to write fault information to 132the fault log (via a POST (Create) request to its corresponding Redfish log 133resource collection). When the fault monitor becomes aware of a new fault 134occurrence through any of these ways, it may add fault information to the fault 135log. The fault monitor may also gather relevant sensor data (read via D-Bus from 136the dbus-sensors services) and add it to the fault log, with a reference to the 137original fault event information. The EventGroupID in a Redfish LogEntry could 138potentially be used to associate multiple log entries related to the same fault 139event. 140 141The fault log for storing relevant fault information (and exposing it to 142external data center monitoring software) will be a new Redfish LogService 143(/redfish/v1/Systems/system/LogServices/FaultLog) with 144`OverwritePolicy=unknown`, in order to implement custom overwrite rules such as 145prioritizing retaining first and/or more severe faults. The back end 146implementation of the fault log including saving and managing log files will be 147added into the existing Phosphor Debug Collector repository with an associated 148D-bus object (e.g. xyz/openbmc_project/dump/faultlog) whose interface will 149include methods for writing new data into the log, retrieving data from the log, 150and clearing the log. The fault log will be implemented as a new dump type in an 151existing Phosphor Debug Collector daemon (specifically the one whose main() 152function is in dump_manager_main.cpp). The new fault log would contain dump 153files that are collected in a variety of ways in a variety of formats. A new 154fault log dump entry class (deriving from the "Entry" class in dump_entry.hpp) 155would be defined with an additional "dump type" member variable to identify the 156type of data that a fault log dump entry's corresponding dump file contains. 157 158bmcweb will be used as the associated Redfish webserver for external entities to 159read and write the fault log. Functionality for handling a POST (Create) request 160to a Redfish log resource collection will be added in bmcweb. When delivering a 161Redfish fault log entry to a Redfish client, large-sized fault information (e.g. 162crashdumps) can be specified as an attachment sub-resource (AdditionalDataURI) 163instead of being inlined. Redfish events (EventService schema) will be used to 164send external notifications, such as when the fault monitor needs to notify 165external data center monitoring software of new fault information being 166available. Redfish events may also be used to notify the host kernel and/or BIOS 167of any repair actions that need to be triggered based on the latest fault 168information. 169 170## Alternatives Considered 171 172We considered adding the fault logs into the main system event log 173(/redfish/v1/Systems/system/LogServices/EventLog) or other logs already existing 174in bmcweb (e.g. /redfish/v1/Systems/system/LogServices/Dump, 175/redfish/v1/Managers/bmc/LogServices/Dump), but we would like to implement a 176separate custom overwrite policy to ensure the most important information (such 177as first errors and most severe errors) is retained for local analysis. 178 179## Impacts 180 181There may be situations where external consumers of fault monitor logs (e.g. 182data center monitoring tools) are running software that is newer or older than 183the version matching the BMC software running on a machine. In such cases, 184consumers can ignore any types of fault information provided by the fault 185monitor that they are not prepared to handle. 186 187Errors are expected to happen infrequently, or to be throttled, so we expect 188little to no performance impact. 189 190## Testing 191 192Error injection mechanisms or simulations may be used to artificially create 193error conditions that will be logged by the fault monitor module. 194 195There is no significant impact expected with regards to CI testing, but we do 196intend to add unit testing for the fault monitor. 197