1b5cddbb2SClaire Weinan# Hardware Fault Monitor 2b5cddbb2SClaire Weinan 3*f4febd00SPatrick WilliamsAuthor: Claire Weinan (cweinan@google.com), daylight22) 4b5cddbb2SClaire Weinan 5*f4febd00SPatrick WilliamsOther contributors: Heinz Boehmer Fiehn (heinzboehmer@google.com) Drew Walton 6*f4febd00SPatrick Williams(acwalton@google.com) 7b5cddbb2SClaire Weinan 8*f4febd00SPatrick WilliamsCreated: Aug 5, 2021 9b5cddbb2SClaire Weinan 10b5cddbb2SClaire Weinan## Problem Description 11*f4febd00SPatrick Williams 12b5cddbb2SClaire WeinanThe goal is to create a new hardware fault monitor which will provide a 13b5cddbb2SClaire Weinanframework for collecting various fault and sensor information and making it 14b5cddbb2SClaire Weinanavailable externally via Redfish for data center monitoring and management 15b5cddbb2SClaire Weinanpurposes. The information logged would include a wide variety of chipset 16b5cddbb2SClaire Weinanregisters and data from manageability hardware. In addition to collecting 17*f4febd00SPatrick Williamsinformation through BMC interfaces, the hardware fault monitor will also receive 18*f4febd00SPatrick Williamsinformation via Redfish from the associated host kernel (specifically for cases 19*f4febd00SPatrick Williamsin which the desired information cannot be collected directly by the BMC, for 20*f4febd00SPatrick Williamsexample when accessing registers that are read and cleared by the host kernel). 21b5cddbb2SClaire Weinan 22*f4febd00SPatrick WilliamsFuture expansion of the hardware fault monitor would include adding the means to 23*f4febd00SPatrick Williamslocally analyze fault and sensor information and then based on specified 24b5cddbb2SClaire Weinancriteria trigger repair actions in the host BIOS or kernel. In addition, the 25b5cddbb2SClaire Weinanhardware fault monitor could receive repair action requests via Redfish from 26b5cddbb2SClaire Weinanexternal data center monitoring software. 27b5cddbb2SClaire Weinan 28b5cddbb2SClaire Weinan## Background and References 29*f4febd00SPatrick Williams 30b5cddbb2SClaire WeinanThe following are a few related existing OpenBMC modules: 31b5cddbb2SClaire Weinan 32*f4febd00SPatrick Williams- Host Error Monitor logs CPU error information such as CATERR details and takes 33*f4febd00SPatrick Williams appropriate actions such as performing resets and collecting crashdumps: 34*f4febd00SPatrick Williams https://github.com/openbmc/host-error-monitor 35b5cddbb2SClaire Weinan 36b5cddbb2SClaire Weinan- bmcweb implements a Redfish webserver for openbmc: 37b5cddbb2SClaire Weinan https://github.com/openbmc/bmcweb. The Redfish LogService schema is available 38b5cddbb2SClaire Weinan for logging purposes and the EventService schema is available for a Redfish 39b5cddbb2SClaire Weinan server to send event notifications to clients. 40b5cddbb2SClaire Weinan 41b5cddbb2SClaire Weinan- Phosphor Debug Collector (phosphor-debug-collector) collects various debug 42b5cddbb2SClaire Weinan dumps and saves them into files: 43b5cddbb2SClaire Weinan https://github.com/openbmc/phosphor-debug-collector 44b5cddbb2SClaire Weinan 45b5cddbb2SClaire Weinan- Dbus-sensors reads and saves sensor values and makes them available to other 46b5cddbb2SClaire Weinan modules via D-Bus: https://github.com/openbmc/dbus-sensors 47b5cddbb2SClaire Weinan 48b5cddbb2SClaire Weinan- SEL logger logs to the IPMI and Redfish system event logs when certain events 49b5cddbb2SClaire Weinan happen, such as sensor readings going beyond their thresholds: 50b5cddbb2SClaire Weinan https://github.com/openbmc/phosphor-sel-logger 51b5cddbb2SClaire Weinan 52b5cddbb2SClaire Weinan- FRU fault manager controls the blinking of LEDs when faults occur: 53b5cddbb2SClaire Weinan https://github.com/openbmc/phosphor-led-manager/blob/master/fault-monitor/fru-fault-monitor.hpp 54b5cddbb2SClaire Weinan 55b5cddbb2SClaire Weinan- Guard On BMC records and manages a list of faulty components for isolation. 56b5cddbb2SClaire Weinan (Both the host and the BMC may identify faulty components and create guard 57b5cddbb2SClaire Weinan records for them): 58b5cddbb2SClaire Weinan https://github.com/openbmc/docs/blob/9c79837a8a20dc8e131cc8f046d1ceb4a731391a/designs/guard-on-bmc.md 59b5cddbb2SClaire Weinan 60b5cddbb2SClaire WeinanThere is an OpenCompute Fault Management Infrastructure proposal that also 61b5cddbb2SClaire Weinanrecommends delivering error logs from the BMC: 62b5cddbb2SClaire Weinanhttps://drive.google.com/file/d/1A9Qc7hB3THw0wiEK_dbXYj85_NOJWrb5/ 63b5cddbb2SClaire Weinan 64b5cddbb2SClaire Weinan## Requirements 65*f4febd00SPatrick Williams 66b5cddbb2SClaire Weinan- The users of this solution are Redfish clients in data center software. The 67b5cddbb2SClaire Weinan goal of the fault monitor is to enable rich error logging (OEM and CPU vendor 68b5cddbb2SClaire Weinan specific) for data center tools to monitor servers, manage repairs, predict 69b5cddbb2SClaire Weinan crashes, etc. 70b5cddbb2SClaire Weinan 71b5cddbb2SClaire Weinan- The fault monitor must be able to handle receiving fault information that is 72*f4febd00SPatrick Williams polled periodically as well as fault information that may come in sporadically 73*f4febd00SPatrick Williams based on fault incidents (e.g. crash dumps). 74b5cddbb2SClaire Weinan 75b5cddbb2SClaire Weinan- The fault monitor should allow for logging of a variety of sizes of fault 76b5cddbb2SClaire Weinan information entries (on the order of bytes to megabytes). In general, more 77b5cddbb2SClaire Weinan severe errors which require more fault information to be collected tend to 78b5cddbb2SClaire Weinan occur less frequently, while less severe errors such as correctable errors 79b5cddbb2SClaire Weinan require less logging but may happen more frequently. 80b5cddbb2SClaire Weinan 81b5cddbb2SClaire Weinan- Fault information must be added to a Redfish LogService in a timely manner 82b5cddbb2SClaire Weinan (within a few seconds of the original event) to be available to external data 83b5cddbb2SClaire Weinan center monitoring software. 84b5cddbb2SClaire Weinan 85b5cddbb2SClaire Weinan- The fault monitor must allow for custom overwrite rules for its log entries 86b5cddbb2SClaire Weinan (e.g. on overflow, save first errors and more severe errors), or guarantee 87b5cddbb2SClaire Weinan that enough space is available in its log such that all data from the most 88b5cddbb2SClaire Weinan recent couple of hours is always kept intact. The log does not have to be 89b5cddbb2SClaire Weinan stored persistently (though it can be). 90b5cddbb2SClaire Weinan 91b5cddbb2SClaire Weinan## Proposed Design 92*f4febd00SPatrick Williams 93b5cddbb2SClaire WeinanA generic fault monitor will be created to collect fault information. First we 94b5cddbb2SClaire Weinandiscuss a few example use cases: 95b5cddbb2SClaire Weinan 96b5cddbb2SClaire Weinan- On CATERR, the Host Error Monitor requests a crash dump (this is an existing 97b5cddbb2SClaire Weinan capability). The crash dump includes chipset registers but doesn’t include 98b5cddbb2SClaire Weinan platform-specific system-level data. The fault monitor would therefore 99b5cddbb2SClaire Weinan additionally collect system-level data such as clock, thermal, and power 100b5cddbb2SClaire Weinan information. This information would be bundled, logged, and associated with 101b5cddbb2SClaire Weinan the crash dump so that it could be post-processed by data center monitoring 102b5cddbb2SClaire Weinan tools without having to join multiple data sources. 103b5cddbb2SClaire Weinan 104b5cddbb2SClaire Weinan- The fault monitor would monitor link level retries and link retrainings of 105b5cddbb2SClaire Weinan high speed serial links such as UPI links. This isn’t typically monitored by 106b5cddbb2SClaire Weinan the host kernel at runtime and the host kernel isn’t able to log it during a 107b5cddbb2SClaire Weinan crash. The fault monitor in the BMC could check link level retries and link 108b5cddbb2SClaire Weinan retrainings during runtime by polling over PECI. If a MCERR or IERR occurred, 109b5cddbb2SClaire Weinan the fault monitor could then add additional information such as high speed 110b5cddbb2SClaire Weinan serial link statistics to error logs. 111b5cddbb2SClaire Weinan 112b5cddbb2SClaire Weinan- In order to monitor memory out of band, a system could be configured to give 113b5cddbb2SClaire Weinan the BMC exclusive access to memory error logging registers (to prevent the 114b5cddbb2SClaire Weinan host kernel from being able to access and clear the registers before the BMC 115b5cddbb2SClaire Weinan could collect the register data). For corrected memory errors, the fault 116b5cddbb2SClaire Weinan monitor could log error registers either through polling or interrupts. Data 117b5cddbb2SClaire Weinan center monitoring tools would use the logs to determine whether memory should 118b5cddbb2SClaire Weinan be swapped or a machine should be removed from usage. 119b5cddbb2SClaire Weinan 120b5cddbb2SClaire WeinanThe fault monitor will not have its own dedicated OpenBMC repository, but will 121b5cddbb2SClaire Weinanconsist of components incorporated into the existing repositories 122b5cddbb2SClaire Weinanhost-error-monitor, bmcweb, and phosphor-debug-collector. 123b5cddbb2SClaire Weinan 124b5cddbb2SClaire WeinanIn the existing Host Error Monitor module, new monitors will be created to add 125b5cddbb2SClaire Weinanfunctionality needed for the fault monitor. For instance, based on the needs of 126b5cddbb2SClaire Weinanthe OEM, the fault monitor will register to be notified of D-Bus signals of 127b5cddbb2SClaire Weinaninterest in order to be alerted when fault events occur. The fault monitor will 128*f4febd00SPatrick Williamsalso poll registers of interest and log their values to the fault log (described 129*f4febd00SPatrick Williamsmore later). In addition, the host will be able to write fault information to 130*f4febd00SPatrick Williamsthe fault log (via a POST (Create) request to its corresponding Redfish log 131*f4febd00SPatrick Williamsresource collection). When the fault monitor becomes aware of a new fault 132*f4febd00SPatrick Williamsoccurrence through any of these ways, it may add fault information to the fault 133*f4febd00SPatrick Williamslog. The fault monitor may also gather relevant sensor data (read via D-Bus from 134*f4febd00SPatrick Williamsthe dbus-sensors services) and add it to the fault log, with a reference to the 135*f4febd00SPatrick Williamsoriginal fault event information. The EventGroupID in a Redfish LogEntry could 136*f4febd00SPatrick Williamspotentially be used to associate multiple log entries related to the same fault 137*f4febd00SPatrick Williamsevent. 138b5cddbb2SClaire Weinan 139b5cddbb2SClaire WeinanThe fault log for storing relevant fault information (and exposing it to 140b5cddbb2SClaire Weinanexternal data center monitoring software) will be a new Redfish LogService 141b5cddbb2SClaire Weinan(/redfish/v1/Systems/system/LogServices/FaultLog) with 142b5cddbb2SClaire Weinan`OverwritePolicy=unknown`, in order to implement custom overwrite rules such as 143b5cddbb2SClaire Weinanprioritizing retaining first and/or more severe faults. The back end 144b5cddbb2SClaire Weinanimplementation of the fault log including saving and managing log files will be 145b5cddbb2SClaire Weinanadded into the existing Phosphor Debug Collector repository with an associated 146b5cddbb2SClaire WeinanD-bus object (e.g. xyz/openbmc_project/dump/faultlog) whose interface will 147*f4febd00SPatrick Williamsinclude methods for writing new data into the log, retrieving data from the log, 148*f4febd00SPatrick Williamsand clearing the log. The fault log will be implemented as a new dump type in an 149*f4febd00SPatrick Williamsexisting Phosphor Debug Collector daemon (specifically the one whose main() 150*f4febd00SPatrick Williamsfunction is in dump_manager_main.cpp). The new fault log would contain dump 151*f4febd00SPatrick Williamsfiles that are collected in a variety of ways in a variety of formats. A new 152*f4febd00SPatrick Williamsfault log dump entry class (deriving from the "Entry" class in dump_entry.hpp) 153*f4febd00SPatrick Williamswould be defined with an additional "dump type" member variable to identify the 154*f4febd00SPatrick Williamstype of data that a fault log dump entry's corresponding dump file contains. 155b5cddbb2SClaire Weinan 156*f4febd00SPatrick Williamsbmcweb will be used as the associated Redfish webserver for external entities to 157*f4febd00SPatrick Williamsread and write the fault log. Functionality for handling a POST (Create) request 158*f4febd00SPatrick Williamsto a Redfish log resource collection will be added in bmcweb. When delivering a 159*f4febd00SPatrick WilliamsRedfish fault log entry to a Redfish client, large-sized fault information (e.g. 160*f4febd00SPatrick Williamscrashdumps) can be specified as an attachment sub-resource (AdditionalDataURI) 161*f4febd00SPatrick Williamsinstead of being inlined. Redfish events (EventService schema) will be used to 162*f4febd00SPatrick Williamssend external notifications, such as when the fault monitor needs to notify 163*f4febd00SPatrick Williamsexternal data center monitoring software of new fault information being 164*f4febd00SPatrick Williamsavailable. Redfish events may also be used to notify the host kernel and/or BIOS 165*f4febd00SPatrick Williamsof any repair actions that need to be triggered based on the latest fault 166*f4febd00SPatrick Williamsinformation. 167b5cddbb2SClaire Weinan 168b5cddbb2SClaire Weinan## Alternatives Considered 169*f4febd00SPatrick Williams 170b5cddbb2SClaire WeinanWe considered adding the fault logs into the main system event log 171*f4febd00SPatrick Williams(/redfish/v1/Systems/system/LogServices/EventLog) or other logs already existing 172*f4febd00SPatrick Williamsin bmcweb (e.g. /redfish/v1/Systems/system/LogServices/Dump, 173b5cddbb2SClaire Weinan/redfish/v1/Managers/bmc/LogServices/Dump), but we would like to implement a 174b5cddbb2SClaire Weinanseparate custom overwrite policy to ensure the most important information (such 175b5cddbb2SClaire Weinanas first errors and most severe errors) is retained for local analysis. 176b5cddbb2SClaire Weinan 177b5cddbb2SClaire Weinan## Impacts 178*f4febd00SPatrick Williams 179b5cddbb2SClaire WeinanThere may be situations where external consumers of fault monitor logs (e.g. 180b5cddbb2SClaire Weinandata center monitoring tools) are running software that is newer or older than 181b5cddbb2SClaire Weinanthe version matching the BMC software running on a machine. In such cases, 182b5cddbb2SClaire Weinanconsumers can ignore any types of fault information provided by the fault 183b5cddbb2SClaire Weinanmonitor that they are not prepared to handle. 184b5cddbb2SClaire Weinan 185b5cddbb2SClaire WeinanErrors are expected to happen infrequently, or to be throttled, so we expect 186b5cddbb2SClaire Weinanlittle to no performance impact. 187b5cddbb2SClaire Weinan 188b5cddbb2SClaire Weinan## Testing 189*f4febd00SPatrick Williams 190b5cddbb2SClaire WeinanError injection mechanisms or simulations may be used to artificially create 191b5cddbb2SClaire Weinanerror conditions that will be logged by the fault monitor module. 192b5cddbb2SClaire Weinan 193b5cddbb2SClaire WeinanThere is no significant impact expected with regards to CI testing, but we do 194b5cddbb2SClaire Weinanintend to add unit testing for the fault monitor. 195