xref: /openbmc/docs/designs/hw-fault-monitor.md (revision 37055f7d)
1# Hardware Fault Monitor
2
3Author:
4  Claire Weinan (cweinan@google.com), daylight22)
5
6Other contributors:
7  Heinz Boehmer Fiehn (heinzboehmer@google.com)
8  Drew Walton (acwalton@google.com)
9
10Created:
11  Aug 5, 2021
12
13## Problem Description
14The goal is to create a new hardware fault monitor which will provide a
15framework for collecting various fault and sensor information and making it
16available externally via Redfish for data center monitoring and management
17purposes. The information logged would include a wide variety of chipset
18registers and data from manageability hardware. In addition to collecting
19information through BMC interfaces, the hardware fault monitor will also
20receive information via Redfish from the associated host kernel (specifically
21for cases in which the desired information cannot be collected directly by the
22BMC, for example when accessing registers that are read and cleared by the host
23kernel).
24
25Future expansion of the hardware fault monitor would include adding the means
26to locally analyze fault and sensor information and then based on specified
27criteria trigger repair actions in the host BIOS or kernel. In addition, the
28hardware fault monitor could receive repair action requests via Redfish from
29external data center monitoring software.
30
31
32## Background and References
33The following are a few related existing OpenBMC modules:
34
35- Host Error Monitor logs CPU error information such as CATERR details and
36  takes appropriate actions such as performing resets and collecting
37  crashdumps: https://github.com/openbmc/host-error-monitor
38
39- bmcweb implements a Redfish webserver for openbmc:
40  https://github.com/openbmc/bmcweb. The Redfish LogService schema is available
41  for logging purposes and the EventService schema is available for a Redfish
42  server to send event notifications to clients.
43
44- Phosphor Debug Collector (phosphor-debug-collector) collects various debug
45  dumps and saves them into files:
46  https://github.com/openbmc/phosphor-debug-collector
47
48- Dbus-sensors reads and saves sensor values and makes them available to other
49  modules via D-Bus: https://github.com/openbmc/dbus-sensors
50
51- SEL logger logs to the IPMI and Redfish system event logs when certain events
52  happen, such as sensor readings going beyond their thresholds:
53  https://github.com/openbmc/phosphor-sel-logger
54
55- FRU fault manager controls the blinking of LEDs when faults occur:
56  https://github.com/openbmc/phosphor-led-manager/blob/master/fault-monitor/fru-fault-monitor.hpp
57
58- Guard On BMC records and manages a list of faulty components for isolation.
59  (Both the host and the BMC may identify faulty components and create guard
60  records for them):
61  https://github.com/openbmc/docs/blob/9c79837a8a20dc8e131cc8f046d1ceb4a731391a/designs/guard-on-bmc.md
62
63There is an OpenCompute Fault Management Infrastructure proposal that also
64recommends delivering error logs from the BMC:
65https://drive.google.com/file/d/1A9Qc7hB3THw0wiEK_dbXYj85_NOJWrb5/
66
67
68## Requirements
69- The users of this solution are Redfish clients in data center software. The
70  goal of the fault monitor is to enable rich error logging (OEM and CPU vendor
71  specific) for data center tools to monitor servers, manage repairs, predict
72  crashes, etc.
73
74- The fault monitor must be able to handle receiving fault information that is
75  polled periodically as well as fault information that may come in
76  sporadically based on fault incidents (e.g. crash dumps).
77
78- The fault monitor should allow for logging of a variety of sizes of fault
79  information entries (on the order of bytes to megabytes). In general, more
80  severe errors which require more fault information to be collected tend to
81  occur less frequently, while less severe errors such as correctable errors
82  require less logging but may happen more frequently.
83
84- Fault information must be added to a Redfish LogService in a timely manner
85  (within a few seconds of the original event) to be available to external data
86  center monitoring software.
87
88- The fault monitor must allow for custom overwrite rules for its log entries
89  (e.g. on overflow, save first errors and more severe errors), or guarantee
90  that enough space is available in its log such that all data from the most
91  recent couple of hours is always kept intact. The log does not have to be
92  stored persistently (though it can be).
93
94
95## Proposed Design
96A generic fault monitor will be created to collect fault information. First we
97discuss a few example use cases:
98
99- On CATERR, the Host Error Monitor requests a crash dump (this is an existing
100  capability). The crash dump includes chipset registers but doesn’t include
101  platform-specific system-level data. The fault monitor would therefore
102  additionally collect system-level data such as clock, thermal, and power
103  information. This information would be bundled, logged, and associated with
104  the crash dump so that it could be post-processed by data center monitoring
105  tools without having to join multiple data sources.
106
107- The fault monitor would monitor link level retries and link retrainings of
108  high speed serial links such as UPI links. This isn’t typically monitored by
109  the host kernel at runtime and the host kernel isn’t able to log it during a
110  crash. The fault monitor in the BMC could check link level retries and link
111  retrainings during runtime by polling over PECI. If a MCERR or IERR occurred,
112  the fault monitor could then add additional information such as high speed
113  serial link statistics to error logs.
114
115- In order to monitor memory out of band, a system could be configured to give
116  the BMC exclusive access to memory error logging registers (to prevent the
117  host kernel from being able to access and clear the registers before the BMC
118  could collect the register data). For corrected memory errors, the fault
119  monitor could log error registers either through polling or interrupts. Data
120  center monitoring tools would use the logs to determine whether memory should
121  be swapped or a machine should be removed from usage.
122
123The fault monitor will not have its own dedicated OpenBMC repository, but will
124consist of components incorporated into the existing repositories
125host-error-monitor, bmcweb, and phosphor-debug-collector.
126
127In the existing Host Error Monitor module, new monitors will be created to add
128functionality needed for the fault monitor. For instance, based on the needs of
129the OEM, the fault monitor will register to be notified of D-Bus signals of
130interest in order to be alerted when fault events occur. The fault monitor will
131also poll registers of interest and log their values to the fault log
132(described more later). In addition, the host will be able to write fault
133information to the fault log (via a POST (Create) request to its corresponding
134Redfish log resource collection). When the fault monitor becomes aware of a new
135fault occurrence through any of these ways, it may add fault information to the
136fault log. The fault monitor may also gather relevant sensor data (read via
137D-Bus from the dbus-sensors services) and add it to the fault log, with a
138reference to the original fault event information. The EventGroupID in a
139Redfish LogEntry could potentially be used to associate multiple log entries
140related to the same fault event.
141
142The fault log for storing relevant fault information (and exposing it to
143external data center monitoring software) will be a new Redfish LogService
144(/redfish/v1/Systems/system/LogServices/FaultLog) with
145`OverwritePolicy=unknown`, in order to implement custom overwrite rules such as
146prioritizing retaining first and/or more severe faults. The back end
147implementation of the fault log including saving and managing log files will be
148added into the existing Phosphor Debug Collector repository with an associated
149D-bus object (e.g. xyz/openbmc_project/dump/faultlog) whose interface will
150include methods for writing new data into the log, retrieving data from the
151log, and clearing the log. The fault log will be implemented as a new dump type
152in an existing Phosphor Debug Collector daemon (specifically the one whose
153main() function is in dump_manager_main.cpp). The new fault log would contain
154dump files that are collected in a variety of ways in a variety of formats. A
155new fault log dump entry class (deriving from the "Entry" class in
156dump_entry.hpp) would be defined with an additional "dump type" member variable
157to identify the type of data that a fault log dump entry's corresponding dump
158file contains.
159
160bmcweb will be used as the associated Redfish webserver for external entities
161to read and write the fault log. Functionality for handling a POST (Create)
162request to a Redfish log resource collection will be added in bmcweb. When
163delivering a Redfish fault log entry to a Redfish client, large-sized fault
164information (e.g. crashdumps) can be specified as an attachment sub-resource
165(AdditionalDataURI) instead of being inlined. Redfish events (EventService
166schema) will be used to send external notifications, such as when the fault
167monitor needs to notify external data center monitoring software of new fault
168information being available. Redfish events may also be used to notify the host
169kernel and/or BIOS of any repair actions that need to be triggered based on the
170latest fault information.
171
172
173## Alternatives Considered
174We considered adding the fault logs into the main system event log
175(/redfish/v1/Systems/system/LogServices/EventLog) or other logs already
176existing in bmcweb (e.g. /redfish/v1/Systems/system/LogServices/Dump,
177/redfish/v1/Managers/bmc/LogServices/Dump), but we would like to implement a
178separate custom overwrite policy to ensure the most important information (such
179as first errors and most severe errors) is retained for local analysis.
180
181
182## Impacts
183There may be situations where external consumers of fault monitor logs (e.g.
184data center monitoring tools) are running software that is newer or older than
185the version matching the BMC software running on a machine. In such cases,
186consumers can ignore any types of fault information provided by the fault
187monitor that they are not prepared to handle.
188
189Errors are expected to happen infrequently, or to be throttled, so we expect
190little to no performance impact.
191
192## Testing
193Error injection mechanisms or simulations may be used to artificially create
194error conditions that will be logged by the fault monitor module.
195
196There is no significant impact expected with regards to CI testing, but we do
197intend to add unit testing for the fault monitor.
198