docs/designs/hw-fault-monitor.md

b5cddbb2SClaire Weinan# Hardware Fault Monitor
b5cddbb2SClaire Weinan
*f4febd00SPatrick WilliamsAuthor: Claire Weinan (cweinan@google.com), daylight22)
b5cddbb2SClaire Weinan
*f4febd00SPatrick WilliamsOther contributors: Heinz Boehmer Fiehn (heinzboehmer@google.com) Drew Walton
*f4febd00SPatrick Williams(acwalton@google.com)
b5cddbb2SClaire Weinan
*f4febd00SPatrick WilliamsCreated: Aug 5, 2021
b5cddbb2SClaire Weinan
b5cddbb2SClaire Weinan## Problem Description
*f4febd00SPatrick Williams
b5cddbb2SClaire WeinanThe goal is to create a new hardware fault monitor which will provide a
b5cddbb2SClaire Weinanframework for collecting various fault and sensor information and making it
b5cddbb2SClaire Weinanavailable externally via Redfish for data center monitoring and management
b5cddbb2SClaire Weinanpurposes. The information logged would include a wide variety of chipset
b5cddbb2SClaire Weinanregisters and data from manageability hardware. In addition to collecting
*f4febd00SPatrick Williamsinformation through BMC interfaces, the hardware fault monitor will also receive
*f4febd00SPatrick Williamsinformation via Redfish from the associated host kernel (specifically for cases
*f4febd00SPatrick Williamsin which the desired information cannot be collected directly by the BMC, for
*f4febd00SPatrick Williamsexample when accessing registers that are read and cleared by the host kernel).
b5cddbb2SClaire Weinan
*f4febd00SPatrick WilliamsFuture expansion of the hardware fault monitor would include adding the means to
*f4febd00SPatrick Williamslocally analyze fault and sensor information and then based on specified
b5cddbb2SClaire Weinancriteria trigger repair actions in the host BIOS or kernel. In addition, the
b5cddbb2SClaire Weinanhardware fault monitor could receive repair action requests via Redfish from
b5cddbb2SClaire Weinanexternal data center monitoring software.
b5cddbb2SClaire Weinan
b5cddbb2SClaire Weinan## Background and References
*f4febd00SPatrick Williams
b5cddbb2SClaire WeinanThe following are a few related existing OpenBMC modules:
b5cddbb2SClaire Weinan
*f4febd00SPatrick Williams- Host Error Monitor logs CPU error information such as CATERR details and takes
*f4febd00SPatrick Williams  appropriate actions such as performing resets and collecting crashdumps:
*f4febd00SPatrick Williams  https://github.com/openbmc/host-error-monitor
b5cddbb2SClaire Weinan
b5cddbb2SClaire Weinan- bmcweb implements a Redfish webserver for openbmc:
b5cddbb2SClaire Weinan  https://github.com/openbmc/bmcweb. The Redfish LogService schema is available
b5cddbb2SClaire Weinan  for logging purposes and the EventService schema is available for a Redfish
b5cddbb2SClaire Weinan  server to send event notifications to clients.
b5cddbb2SClaire Weinan
b5cddbb2SClaire Weinan- Phosphor Debug Collector (phosphor-debug-collector) collects various debug
b5cddbb2SClaire Weinan  dumps and saves them into files:
b5cddbb2SClaire Weinan  https://github.com/openbmc/phosphor-debug-collector
b5cddbb2SClaire Weinan
b5cddbb2SClaire Weinan- Dbus-sensors reads and saves sensor values and makes them available to other
b5cddbb2SClaire Weinan  modules via D-Bus: https://github.com/openbmc/dbus-sensors
b5cddbb2SClaire Weinan
b5cddbb2SClaire Weinan- SEL logger logs to the IPMI and Redfish system event logs when certain events
b5cddbb2SClaire Weinan  happen, such as sensor readings going beyond their thresholds:
b5cddbb2SClaire Weinan  https://github.com/openbmc/phosphor-sel-logger
b5cddbb2SClaire Weinan
b5cddbb2SClaire Weinan- FRU fault manager controls the blinking of LEDs when faults occur:
b5cddbb2SClaire Weinan  https://github.com/openbmc/phosphor-led-manager/blob/master/fault-monitor/fru-fault-monitor.hpp
b5cddbb2SClaire Weinan
b5cddbb2SClaire Weinan- Guard On BMC records and manages a list of faulty components for isolation.
b5cddbb2SClaire Weinan  (Both the host and the BMC may identify faulty components and create guard
b5cddbb2SClaire Weinan  records for them):
b5cddbb2SClaire Weinan  https://github.com/openbmc/docs/blob/9c79837a8a20dc8e131cc8f046d1ceb4a731391a/designs/guard-on-bmc.md
b5cddbb2SClaire Weinan
b5cddbb2SClaire WeinanThere is an OpenCompute Fault Management Infrastructure proposal that also
b5cddbb2SClaire Weinanrecommends delivering error logs from the BMC:
b5cddbb2SClaire Weinanhttps://drive.google.com/file/d/1A9Qc7hB3THw0wiEK_dbXYj85_NOJWrb5/
b5cddbb2SClaire Weinan
b5cddbb2SClaire Weinan## Requirements
*f4febd00SPatrick Williams
b5cddbb2SClaire Weinan- The users of this solution are Redfish clients in data center software. The
b5cddbb2SClaire Weinan  goal of the fault monitor is to enable rich error logging (OEM and CPU vendor
b5cddbb2SClaire Weinan  specific) for data center tools to monitor servers, manage repairs, predict
b5cddbb2SClaire Weinan  crashes, etc.
b5cddbb2SClaire Weinan
b5cddbb2SClaire Weinan- The fault monitor must be able to handle receiving fault information that is
*f4febd00SPatrick Williams  polled periodically as well as fault information that may come in sporadically
*f4febd00SPatrick Williams  based on fault incidents (e.g. crash dumps).
b5cddbb2SClaire Weinan
b5cddbb2SClaire Weinan- The fault monitor should allow for logging of a variety of sizes of fault
b5cddbb2SClaire Weinan  information entries (on the order of bytes to megabytes). In general, more
b5cddbb2SClaire Weinan  severe errors which require more fault information to be collected tend to
b5cddbb2SClaire Weinan  occur less frequently, while less severe errors such as correctable errors
b5cddbb2SClaire Weinan  require less logging but may happen more frequently.
b5cddbb2SClaire Weinan
b5cddbb2SClaire Weinan- Fault information must be added to a Redfish LogService in a timely manner
b5cddbb2SClaire Weinan  (within a few seconds of the original event) to be available to external data
b5cddbb2SClaire Weinan  center monitoring software.
b5cddbb2SClaire Weinan
b5cddbb2SClaire Weinan- The fault monitor must allow for custom overwrite rules for its log entries
b5cddbb2SClaire Weinan  (e.g. on overflow, save first errors and more severe errors), or guarantee
b5cddbb2SClaire Weinan  that enough space is available in its log such that all data from the most
b5cddbb2SClaire Weinan  recent couple of hours is always kept intact. The log does not have to be
b5cddbb2SClaire Weinan  stored persistently (though it can be).
b5cddbb2SClaire Weinan
b5cddbb2SClaire Weinan## Proposed Design
*f4febd00SPatrick Williams
b5cddbb2SClaire WeinanA generic fault monitor will be created to collect fault information. First we
b5cddbb2SClaire Weinandiscuss a few example use cases:
b5cddbb2SClaire Weinan
b5cddbb2SClaire Weinan- On CATERR, the Host Error Monitor requests a crash dump (this is an existing
b5cddbb2SClaire Weinan  capability). The crash dump includes chipset registers but doesn’t include
b5cddbb2SClaire Weinan  platform-specific system-level data. The fault monitor would therefore
b5cddbb2SClaire Weinan  additionally collect system-level data such as clock, thermal, and power
b5cddbb2SClaire Weinan  information. This information would be bundled, logged, and associated with
b5cddbb2SClaire Weinan  the crash dump so that it could be post-processed by data center monitoring
b5cddbb2SClaire Weinan  tools without having to join multiple data sources.
b5cddbb2SClaire Weinan
b5cddbb2SClaire Weinan- The fault monitor would monitor link level retries and link retrainings of
b5cddbb2SClaire Weinan  high speed serial links such as UPI links. This isn’t typically monitored by
b5cddbb2SClaire Weinan  the host kernel at runtime and the host kernel isn’t able to log it during a
b5cddbb2SClaire Weinan  crash. The fault monitor in the BMC could check link level retries and link
b5cddbb2SClaire Weinan  retrainings during runtime by polling over PECI. If a MCERR or IERR occurred,
b5cddbb2SClaire Weinan  the fault monitor could then add additional information such as high speed
b5cddbb2SClaire Weinan  serial link statistics to error logs.
b5cddbb2SClaire Weinan
b5cddbb2SClaire Weinan- In order to monitor memory out of band, a system could be configured to give
b5cddbb2SClaire Weinan  the BMC exclusive access to memory error logging registers (to prevent the
b5cddbb2SClaire Weinan  host kernel from being able to access and clear the registers before the BMC
b5cddbb2SClaire Weinan  could collect the register data). For corrected memory errors, the fault
b5cddbb2SClaire Weinan  monitor could log error registers either through polling or interrupts. Data
b5cddbb2SClaire Weinan  center monitoring tools would use the logs to determine whether memory should
b5cddbb2SClaire Weinan  be swapped or a machine should be removed from usage.
b5cddbb2SClaire Weinan
b5cddbb2SClaire WeinanThe fault monitor will not have its own dedicated OpenBMC repository, but will
b5cddbb2SClaire Weinanconsist of components incorporated into the existing repositories
b5cddbb2SClaire Weinanhost-error-monitor, bmcweb, and phosphor-debug-collector.
b5cddbb2SClaire Weinan
b5cddbb2SClaire WeinanIn the existing Host Error Monitor module, new monitors will be created to add
b5cddbb2SClaire Weinanfunctionality needed for the fault monitor. For instance, based on the needs of
b5cddbb2SClaire Weinanthe OEM, the fault monitor will register to be notified of D-Bus signals of
b5cddbb2SClaire Weinaninterest in order to be alerted when fault events occur. The fault monitor will
*f4febd00SPatrick Williamsalso poll registers of interest and log their values to the fault log (described
*f4febd00SPatrick Williamsmore later). In addition, the host will be able to write fault information to
*f4febd00SPatrick Williamsthe fault log (via a POST (Create) request to its corresponding Redfish log
*f4febd00SPatrick Williamsresource collection). When the fault monitor becomes aware of a new fault
*f4febd00SPatrick Williamsoccurrence through any of these ways, it may add fault information to the fault
*f4febd00SPatrick Williamslog. The fault monitor may also gather relevant sensor data (read via D-Bus from
*f4febd00SPatrick Williamsthe dbus-sensors services) and add it to the fault log, with a reference to the
*f4febd00SPatrick Williamsoriginal fault event information. The EventGroupID in a Redfish LogEntry could
*f4febd00SPatrick Williamspotentially be used to associate multiple log entries related to the same fault
*f4febd00SPatrick Williamsevent.
b5cddbb2SClaire Weinan
b5cddbb2SClaire WeinanThe fault log for storing relevant fault information (and exposing it to
b5cddbb2SClaire Weinanexternal data center monitoring software) will be a new Redfish LogService
b5cddbb2SClaire Weinan(/redfish/v1/Systems/system/LogServices/FaultLog) with
b5cddbb2SClaire Weinan`OverwritePolicy=unknown`, in order to implement custom overwrite rules such as
b5cddbb2SClaire Weinanprioritizing retaining first and/or more severe faults. The back end
b5cddbb2SClaire Weinanimplementation of the fault log including saving and managing log files will be
b5cddbb2SClaire Weinanadded into the existing Phosphor Debug Collector repository with an associated
b5cddbb2SClaire WeinanD-bus object (e.g. xyz/openbmc_project/dump/faultlog) whose interface will
*f4febd00SPatrick Williamsinclude methods for writing new data into the log, retrieving data from the log,
*f4febd00SPatrick Williamsand clearing the log. The fault log will be implemented as a new dump type in an
*f4febd00SPatrick Williamsexisting Phosphor Debug Collector daemon (specifically the one whose main()
*f4febd00SPatrick Williamsfunction is in dump_manager_main.cpp). The new fault log would contain dump
*f4febd00SPatrick Williamsfiles that are collected in a variety of ways in a variety of formats. A new
*f4febd00SPatrick Williamsfault log dump entry class (deriving from the "Entry" class in dump_entry.hpp)
*f4febd00SPatrick Williamswould be defined with an additional "dump type" member variable to identify the
*f4febd00SPatrick Williamstype of data that a fault log dump entry's corresponding dump file contains.
b5cddbb2SClaire Weinan
*f4febd00SPatrick Williamsbmcweb will be used as the associated Redfish webserver for external entities to
*f4febd00SPatrick Williamsread and write the fault log. Functionality for handling a POST (Create) request
*f4febd00SPatrick Williamsto a Redfish log resource collection will be added in bmcweb. When delivering a
*f4febd00SPatrick WilliamsRedfish fault log entry to a Redfish client, large-sized fault information (e.g.
*f4febd00SPatrick Williamscrashdumps) can be specified as an attachment sub-resource (AdditionalDataURI)
*f4febd00SPatrick Williamsinstead of being inlined. Redfish events (EventService schema) will be used to
*f4febd00SPatrick Williamssend external notifications, such as when the fault monitor needs to notify
*f4febd00SPatrick Williamsexternal data center monitoring software of new fault information being
*f4febd00SPatrick Williamsavailable. Redfish events may also be used to notify the host kernel and/or BIOS
*f4febd00SPatrick Williamsof any repair actions that need to be triggered based on the latest fault
*f4febd00SPatrick Williamsinformation.
b5cddbb2SClaire Weinan
b5cddbb2SClaire Weinan## Alternatives Considered
*f4febd00SPatrick Williams
b5cddbb2SClaire WeinanWe considered adding the fault logs into the main system event log
*f4febd00SPatrick Williams(/redfish/v1/Systems/system/LogServices/EventLog) or other logs already existing
*f4febd00SPatrick Williamsin bmcweb (e.g. /redfish/v1/Systems/system/LogServices/Dump,
b5cddbb2SClaire Weinan/redfish/v1/Managers/bmc/LogServices/Dump), but we would like to implement a
b5cddbb2SClaire Weinanseparate custom overwrite policy to ensure the most important information (such
b5cddbb2SClaire Weinanas first errors and most severe errors) is retained for local analysis.
b5cddbb2SClaire Weinan
b5cddbb2SClaire Weinan## Impacts
*f4febd00SPatrick Williams
b5cddbb2SClaire WeinanThere may be situations where external consumers of fault monitor logs (e.g.
b5cddbb2SClaire Weinandata center monitoring tools) are running software that is newer or older than
b5cddbb2SClaire Weinanthe version matching the BMC software running on a machine. In such cases,
b5cddbb2SClaire Weinanconsumers can ignore any types of fault information provided by the fault
b5cddbb2SClaire Weinanmonitor that they are not prepared to handle.
b5cddbb2SClaire Weinan
b5cddbb2SClaire WeinanErrors are expected to happen infrequently, or to be throttled, so we expect
b5cddbb2SClaire Weinanlittle to no performance impact.
b5cddbb2SClaire Weinan
b5cddbb2SClaire Weinan## Testing
*f4febd00SPatrick Williams
b5cddbb2SClaire WeinanError injection mechanisms or simulations may be used to artificially create
b5cddbb2SClaire Weinanerror conditions that will be logged by the fault monitor module.
b5cddbb2SClaire Weinan
b5cddbb2SClaire WeinanThere is no significant impact expected with regards to CI testing, but we do
b5cddbb2SClaire Weinanintend to add unit testing for the fault monitor.