xref: /openbmc/docs/designs/hw-fault-monitor.md (revision f4febd00)
1# Hardware Fault Monitor
2
3Author: Claire Weinan (cweinan@google.com), daylight22)
4
5Other contributors: Heinz Boehmer Fiehn (heinzboehmer@google.com) Drew Walton
6(acwalton@google.com)
7
8Created: Aug 5, 2021
9
10## Problem Description
11
12The goal is to create a new hardware fault monitor which will provide a
13framework for collecting various fault and sensor information and making it
14available externally via Redfish for data center monitoring and management
15purposes. The information logged would include a wide variety of chipset
16registers and data from manageability hardware. In addition to collecting
17information through BMC interfaces, the hardware fault monitor will also receive
18information via Redfish from the associated host kernel (specifically for cases
19in which the desired information cannot be collected directly by the BMC, for
20example when accessing registers that are read and cleared by the host kernel).
21
22Future expansion of the hardware fault monitor would include adding the means to
23locally analyze fault and sensor information and then based on specified
24criteria trigger repair actions in the host BIOS or kernel. In addition, the
25hardware fault monitor could receive repair action requests via Redfish from
26external data center monitoring software.
27
28## Background and References
29
30The following are a few related existing OpenBMC modules:
31
32- Host Error Monitor logs CPU error information such as CATERR details and takes
33  appropriate actions such as performing resets and collecting crashdumps:
34  https://github.com/openbmc/host-error-monitor
35
36- bmcweb implements a Redfish webserver for openbmc:
37  https://github.com/openbmc/bmcweb. The Redfish LogService schema is available
38  for logging purposes and the EventService schema is available for a Redfish
39  server to send event notifications to clients.
40
41- Phosphor Debug Collector (phosphor-debug-collector) collects various debug
42  dumps and saves them into files:
43  https://github.com/openbmc/phosphor-debug-collector
44
45- Dbus-sensors reads and saves sensor values and makes them available to other
46  modules via D-Bus: https://github.com/openbmc/dbus-sensors
47
48- SEL logger logs to the IPMI and Redfish system event logs when certain events
49  happen, such as sensor readings going beyond their thresholds:
50  https://github.com/openbmc/phosphor-sel-logger
51
52- FRU fault manager controls the blinking of LEDs when faults occur:
53  https://github.com/openbmc/phosphor-led-manager/blob/master/fault-monitor/fru-fault-monitor.hpp
54
55- Guard On BMC records and manages a list of faulty components for isolation.
56  (Both the host and the BMC may identify faulty components and create guard
57  records for them):
58  https://github.com/openbmc/docs/blob/9c79837a8a20dc8e131cc8f046d1ceb4a731391a/designs/guard-on-bmc.md
59
60There is an OpenCompute Fault Management Infrastructure proposal that also
61recommends delivering error logs from the BMC:
62https://drive.google.com/file/d/1A9Qc7hB3THw0wiEK_dbXYj85_NOJWrb5/
63
64## Requirements
65
66- The users of this solution are Redfish clients in data center software. The
67  goal of the fault monitor is to enable rich error logging (OEM and CPU vendor
68  specific) for data center tools to monitor servers, manage repairs, predict
69  crashes, etc.
70
71- The fault monitor must be able to handle receiving fault information that is
72  polled periodically as well as fault information that may come in sporadically
73  based on fault incidents (e.g. crash dumps).
74
75- The fault monitor should allow for logging of a variety of sizes of fault
76  information entries (on the order of bytes to megabytes). In general, more
77  severe errors which require more fault information to be collected tend to
78  occur less frequently, while less severe errors such as correctable errors
79  require less logging but may happen more frequently.
80
81- Fault information must be added to a Redfish LogService in a timely manner
82  (within a few seconds of the original event) to be available to external data
83  center monitoring software.
84
85- The fault monitor must allow for custom overwrite rules for its log entries
86  (e.g. on overflow, save first errors and more severe errors), or guarantee
87  that enough space is available in its log such that all data from the most
88  recent couple of hours is always kept intact. The log does not have to be
89  stored persistently (though it can be).
90
91## Proposed Design
92
93A generic fault monitor will be created to collect fault information. First we
94discuss a few example use cases:
95
96- On CATERR, the Host Error Monitor requests a crash dump (this is an existing
97  capability). The crash dump includes chipset registers but doesn’t include
98  platform-specific system-level data. The fault monitor would therefore
99  additionally collect system-level data such as clock, thermal, and power
100  information. This information would be bundled, logged, and associated with
101  the crash dump so that it could be post-processed by data center monitoring
102  tools without having to join multiple data sources.
103
104- The fault monitor would monitor link level retries and link retrainings of
105  high speed serial links such as UPI links. This isn’t typically monitored by
106  the host kernel at runtime and the host kernel isn’t able to log it during a
107  crash. The fault monitor in the BMC could check link level retries and link
108  retrainings during runtime by polling over PECI. If a MCERR or IERR occurred,
109  the fault monitor could then add additional information such as high speed
110  serial link statistics to error logs.
111
112- In order to monitor memory out of band, a system could be configured to give
113  the BMC exclusive access to memory error logging registers (to prevent the
114  host kernel from being able to access and clear the registers before the BMC
115  could collect the register data). For corrected memory errors, the fault
116  monitor could log error registers either through polling or interrupts. Data
117  center monitoring tools would use the logs to determine whether memory should
118  be swapped or a machine should be removed from usage.
119
120The fault monitor will not have its own dedicated OpenBMC repository, but will
121consist of components incorporated into the existing repositories
122host-error-monitor, bmcweb, and phosphor-debug-collector.
123
124In the existing Host Error Monitor module, new monitors will be created to add
125functionality needed for the fault monitor. For instance, based on the needs of
126the OEM, the fault monitor will register to be notified of D-Bus signals of
127interest in order to be alerted when fault events occur. The fault monitor will
128also poll registers of interest and log their values to the fault log (described
129more later). In addition, the host will be able to write fault information to
130the fault log (via a POST (Create) request to its corresponding Redfish log
131resource collection). When the fault monitor becomes aware of a new fault
132occurrence through any of these ways, it may add fault information to the fault
133log. The fault monitor may also gather relevant sensor data (read via D-Bus from
134the dbus-sensors services) and add it to the fault log, with a reference to the
135original fault event information. The EventGroupID in a Redfish LogEntry could
136potentially be used to associate multiple log entries related to the same fault
137event.
138
139The fault log for storing relevant fault information (and exposing it to
140external data center monitoring software) will be a new Redfish LogService
141(/redfish/v1/Systems/system/LogServices/FaultLog) with
142`OverwritePolicy=unknown`, in order to implement custom overwrite rules such as
143prioritizing retaining first and/or more severe faults. The back end
144implementation of the fault log including saving and managing log files will be
145added into the existing Phosphor Debug Collector repository with an associated
146D-bus object (e.g. xyz/openbmc_project/dump/faultlog) whose interface will
147include methods for writing new data into the log, retrieving data from the log,
148and clearing the log. The fault log will be implemented as a new dump type in an
149existing Phosphor Debug Collector daemon (specifically the one whose main()
150function is in dump_manager_main.cpp). The new fault log would contain dump
151files that are collected in a variety of ways in a variety of formats. A new
152fault log dump entry class (deriving from the "Entry" class in dump_entry.hpp)
153would be defined with an additional "dump type" member variable to identify the
154type of data that a fault log dump entry's corresponding dump file contains.
155
156bmcweb will be used as the associated Redfish webserver for external entities to
157read and write the fault log. Functionality for handling a POST (Create) request
158to a Redfish log resource collection will be added in bmcweb. When delivering a
159Redfish fault log entry to a Redfish client, large-sized fault information (e.g.
160crashdumps) can be specified as an attachment sub-resource (AdditionalDataURI)
161instead of being inlined. Redfish events (EventService schema) will be used to
162send external notifications, such as when the fault monitor needs to notify
163external data center monitoring software of new fault information being
164available. Redfish events may also be used to notify the host kernel and/or BIOS
165of any repair actions that need to be triggered based on the latest fault
166information.
167
168## Alternatives Considered
169
170We considered adding the fault logs into the main system event log
171(/redfish/v1/Systems/system/LogServices/EventLog) or other logs already existing
172in bmcweb (e.g. /redfish/v1/Systems/system/LogServices/Dump,
173/redfish/v1/Managers/bmc/LogServices/Dump), but we would like to implement a
174separate custom overwrite policy to ensure the most important information (such
175as first errors and most severe errors) is retained for local analysis.
176
177## Impacts
178
179There may be situations where external consumers of fault monitor logs (e.g.
180data center monitoring tools) are running software that is newer or older than
181the version matching the BMC software running on a machine. In such cases,
182consumers can ignore any types of fault information provided by the fault
183monitor that they are not prepared to handle.
184
185Errors are expected to happen infrequently, or to be throttled, so we expect
186little to no performance impact.
187
188## Testing
189
190Error injection mechanisms or simulations may be used to artificially create
191error conditions that will be logged by the fault monitor module.
192
193There is no significant impact expected with regards to CI testing, but we do
194intend to add unit testing for the fault monitor.
195