1# Hardware Fault Monitor 2 3Author: 4 Claire Weinan (cweinan@google.com), daylight22) 5 6Primary assignee: 7 Claire Weinan (cweinan@google.com, daylight22), 8 Heinz Boehmer Fiehn (heinzboehmer@google.com) 9 10Other contributors: 11 Drew Walton (acwalton@google.com) 12 13Created: 14 Aug 5, 2021 15 16## Problem Description 17The goal is to create a new hardware fault monitor which will provide a 18framework for collecting various fault and sensor information and making it 19available externally via Redfish for data center monitoring and management 20purposes. The information logged would include a wide variety of chipset 21registers and data from manageability hardware. In addition to collecting 22information through BMC interfaces, the hardware fault monitor will also 23receive information via Redfish from the associated host kernel (specifically 24for cases in which the desired information cannot be collected directly by the 25BMC, for example when accessing registers that are read and cleared by the host 26kernel). 27 28Future expansion of the hardware fault monitor would include adding the means 29to locally analyze fault and sensor information and then based on specified 30criteria trigger repair actions in the host BIOS or kernel. In addition, the 31hardware fault monitor could receive repair action requests via Redfish from 32external data center monitoring software. 33 34 35## Background and References 36The following are a few related existing OpenBMC modules: 37 38- Host Error Monitor logs CPU error information such as CATERR details and 39 takes appropriate actions such as performing resets and collecting 40 crashdumps: https://github.com/openbmc/host-error-monitor 41 42- bmcweb implements a Redfish webserver for openbmc: 43 https://github.com/openbmc/bmcweb. The Redfish LogService schema is available 44 for logging purposes and the EventService schema is available for a Redfish 45 server to send event notifications to clients. 46 47- Phosphor Debug Collector (phosphor-debug-collector) collects various debug 48 dumps and saves them into files: 49 https://github.com/openbmc/phosphor-debug-collector 50 51- Dbus-sensors reads and saves sensor values and makes them available to other 52 modules via D-Bus: https://github.com/openbmc/dbus-sensors 53 54- SEL logger logs to the IPMI and Redfish system event logs when certain events 55 happen, such as sensor readings going beyond their thresholds: 56 https://github.com/openbmc/phosphor-sel-logger 57 58- FRU fault manager controls the blinking of LEDs when faults occur: 59 https://github.com/openbmc/phosphor-led-manager/blob/master/fault-monitor/fru-fault-monitor.hpp 60 61- Guard On BMC records and manages a list of faulty components for isolation. 62 (Both the host and the BMC may identify faulty components and create guard 63 records for them): 64 https://github.com/openbmc/docs/blob/9c79837a8a20dc8e131cc8f046d1ceb4a731391a/designs/guard-on-bmc.md 65 66There is an OpenCompute Fault Management Infrastructure proposal that also 67recommends delivering error logs from the BMC: 68https://drive.google.com/file/d/1A9Qc7hB3THw0wiEK_dbXYj85_NOJWrb5/ 69 70 71## Requirements 72- The users of this solution are Redfish clients in data center software. The 73 goal of the fault monitor is to enable rich error logging (OEM and CPU vendor 74 specific) for data center tools to monitor servers, manage repairs, predict 75 crashes, etc. 76 77- The fault monitor must be able to handle receiving fault information that is 78 polled periodically as well as fault information that may come in 79 sporadically based on fault incidents (e.g. crash dumps). 80 81- The fault monitor should allow for logging of a variety of sizes of fault 82 information entries (on the order of bytes to megabytes). In general, more 83 severe errors which require more fault information to be collected tend to 84 occur less frequently, while less severe errors such as correctable errors 85 require less logging but may happen more frequently. 86 87- Fault information must be added to a Redfish LogService in a timely manner 88 (within a few seconds of the original event) to be available to external data 89 center monitoring software. 90 91- The fault monitor must allow for custom overwrite rules for its log entries 92 (e.g. on overflow, save first errors and more severe errors), or guarantee 93 that enough space is available in its log such that all data from the most 94 recent couple of hours is always kept intact. The log does not have to be 95 stored persistently (though it can be). 96 97 98## Proposed Design 99A generic fault monitor will be created to collect fault information. First we 100discuss a few example use cases: 101 102- On CATERR, the Host Error Monitor requests a crash dump (this is an existing 103 capability). The crash dump includes chipset registers but doesn’t include 104 platform-specific system-level data. The fault monitor would therefore 105 additionally collect system-level data such as clock, thermal, and power 106 information. This information would be bundled, logged, and associated with 107 the crash dump so that it could be post-processed by data center monitoring 108 tools without having to join multiple data sources. 109 110- The fault monitor would monitor link level retries and link retrainings of 111 high speed serial links such as UPI links. This isn’t typically monitored by 112 the host kernel at runtime and the host kernel isn’t able to log it during a 113 crash. The fault monitor in the BMC could check link level retries and link 114 retrainings during runtime by polling over PECI. If a MCERR or IERR occurred, 115 the fault monitor could then add additional information such as high speed 116 serial link statistics to error logs. 117 118- In order to monitor memory out of band, a system could be configured to give 119 the BMC exclusive access to memory error logging registers (to prevent the 120 host kernel from being able to access and clear the registers before the BMC 121 could collect the register data). For corrected memory errors, the fault 122 monitor could log error registers either through polling or interrupts. Data 123 center monitoring tools would use the logs to determine whether memory should 124 be swapped or a machine should be removed from usage. 125 126The fault monitor will not have its own dedicated OpenBMC repository, but will 127consist of components incorporated into the existing repositories 128host-error-monitor, bmcweb, and phosphor-debug-collector. 129 130In the existing Host Error Monitor module, new monitors will be created to add 131functionality needed for the fault monitor. For instance, based on the needs of 132the OEM, the fault monitor will register to be notified of D-Bus signals of 133interest in order to be alerted when fault events occur. The fault monitor will 134also poll registers of interest and log their values to the fault log 135(described more later). In addition, the host will be able to write fault 136information to the fault log (via a POST (Create) request to its corresponding 137Redfish log resource collection). When the fault monitor becomes aware of a new 138fault occurrence through any of these ways, it may add fault information to the 139fault log. The fault monitor may also gather relevant sensor data (read via 140D-Bus from the dbus-sensors services) and add it to the fault log, with a 141reference to the original fault event information. The EventGroupID in a 142Redfish LogEntry could potentially be used to associate multiple log entries 143related to the same fault event. 144 145The fault log for storing relevant fault information (and exposing it to 146external data center monitoring software) will be a new Redfish LogService 147(/redfish/v1/Systems/system/LogServices/FaultLog) with 148`OverwritePolicy=unknown`, in order to implement custom overwrite rules such as 149prioritizing retaining first and/or more severe faults. The back end 150implementation of the fault log including saving and managing log files will be 151added into the existing Phosphor Debug Collector repository with an associated 152D-bus object (e.g. xyz/openbmc_project/dump/faultlog) whose interface will 153include methods for writing new data into the log, retrieving data from the 154log, and clearing the log. The fault log will be implemented as a new dump type 155in an existing Phosphor Debug Collector daemon (specifically the one whose 156main() function is in dump_manager_main.cpp). The new fault log would contain 157dump files that are collected in a variety of ways in a variety of formats. A 158new fault log dump entry class (deriving from the "Entry" class in 159dump_entry.hpp) would be defined with an additional "dump type" member variable 160to identify the type of data that a fault log dump entry's corresponding dump 161file contains. 162 163bmcweb will be used as the associated Redfish webserver for external entities 164to read and write the fault log. Functionality for handling a POST (Create) 165request to a Redfish log resource collection will be added in bmcweb. When 166delivering a Redfish fault log entry to a Redfish client, large-sized fault 167information (e.g. crashdumps) can be specified as an attachment sub-resource 168(AdditionalDataURI) instead of being inlined. Redfish events (EventService 169schema) will be used to send external notifications, such as when the fault 170monitor needs to notify external data center monitoring software of new fault 171information being available. Redfish events may also be used to notify the host 172kernel and/or BIOS of any repair actions that need to be triggered based on the 173latest fault information. 174 175 176## Alternatives Considered 177We considered adding the fault logs into the main system event log 178(/redfish/v1/Systems/system/LogServices/EventLog) or other logs already 179existing in bmcweb (e.g. /redfish/v1/Systems/system/LogServices/Dump, 180/redfish/v1/Managers/bmc/LogServices/Dump), but we would like to implement a 181separate custom overwrite policy to ensure the most important information (such 182as first errors and most severe errors) is retained for local analysis. 183 184 185## Impacts 186There may be situations where external consumers of fault monitor logs (e.g. 187data center monitoring tools) are running software that is newer or older than 188the version matching the BMC software running on a machine. In such cases, 189consumers can ignore any types of fault information provided by the fault 190monitor that they are not prepared to handle. 191 192Errors are expected to happen infrequently, or to be throttled, so we expect 193little to no performance impact. 194 195## Testing 196Error injection mechanisms or simulations may be used to artificially create 197error conditions that will be logged by the fault monitor module. 198 199There is no significant impact expected with regards to CI testing, but we do 200intend to add unit testing for the fault monitor. 201