xref: /openbmc/docs/designs/hw-fault-monitor.md (revision f4febd002df578bad816239b70950f84ea4567e8)
1b5cddbb2SClaire Weinan# Hardware Fault Monitor
2b5cddbb2SClaire Weinan
3*f4febd00SPatrick WilliamsAuthor: Claire Weinan (cweinan@google.com), daylight22)
4b5cddbb2SClaire Weinan
5*f4febd00SPatrick WilliamsOther contributors: Heinz Boehmer Fiehn (heinzboehmer@google.com) Drew Walton
6*f4febd00SPatrick Williams(acwalton@google.com)
7b5cddbb2SClaire Weinan
8*f4febd00SPatrick WilliamsCreated: Aug 5, 2021
9b5cddbb2SClaire Weinan
10b5cddbb2SClaire Weinan## Problem Description
11*f4febd00SPatrick Williams
12b5cddbb2SClaire WeinanThe goal is to create a new hardware fault monitor which will provide a
13b5cddbb2SClaire Weinanframework for collecting various fault and sensor information and making it
14b5cddbb2SClaire Weinanavailable externally via Redfish for data center monitoring and management
15b5cddbb2SClaire Weinanpurposes. The information logged would include a wide variety of chipset
16b5cddbb2SClaire Weinanregisters and data from manageability hardware. In addition to collecting
17*f4febd00SPatrick Williamsinformation through BMC interfaces, the hardware fault monitor will also receive
18*f4febd00SPatrick Williamsinformation via Redfish from the associated host kernel (specifically for cases
19*f4febd00SPatrick Williamsin which the desired information cannot be collected directly by the BMC, for
20*f4febd00SPatrick Williamsexample when accessing registers that are read and cleared by the host kernel).
21b5cddbb2SClaire Weinan
22*f4febd00SPatrick WilliamsFuture expansion of the hardware fault monitor would include adding the means to
23*f4febd00SPatrick Williamslocally analyze fault and sensor information and then based on specified
24b5cddbb2SClaire Weinancriteria trigger repair actions in the host BIOS or kernel. In addition, the
25b5cddbb2SClaire Weinanhardware fault monitor could receive repair action requests via Redfish from
26b5cddbb2SClaire Weinanexternal data center monitoring software.
27b5cddbb2SClaire Weinan
28b5cddbb2SClaire Weinan## Background and References
29*f4febd00SPatrick Williams
30b5cddbb2SClaire WeinanThe following are a few related existing OpenBMC modules:
31b5cddbb2SClaire Weinan
32*f4febd00SPatrick Williams- Host Error Monitor logs CPU error information such as CATERR details and takes
33*f4febd00SPatrick Williams  appropriate actions such as performing resets and collecting crashdumps:
34*f4febd00SPatrick Williams  https://github.com/openbmc/host-error-monitor
35b5cddbb2SClaire Weinan
36b5cddbb2SClaire Weinan- bmcweb implements a Redfish webserver for openbmc:
37b5cddbb2SClaire Weinan  https://github.com/openbmc/bmcweb. The Redfish LogService schema is available
38b5cddbb2SClaire Weinan  for logging purposes and the EventService schema is available for a Redfish
39b5cddbb2SClaire Weinan  server to send event notifications to clients.
40b5cddbb2SClaire Weinan
41b5cddbb2SClaire Weinan- Phosphor Debug Collector (phosphor-debug-collector) collects various debug
42b5cddbb2SClaire Weinan  dumps and saves them into files:
43b5cddbb2SClaire Weinan  https://github.com/openbmc/phosphor-debug-collector
44b5cddbb2SClaire Weinan
45b5cddbb2SClaire Weinan- Dbus-sensors reads and saves sensor values and makes them available to other
46b5cddbb2SClaire Weinan  modules via D-Bus: https://github.com/openbmc/dbus-sensors
47b5cddbb2SClaire Weinan
48b5cddbb2SClaire Weinan- SEL logger logs to the IPMI and Redfish system event logs when certain events
49b5cddbb2SClaire Weinan  happen, such as sensor readings going beyond their thresholds:
50b5cddbb2SClaire Weinan  https://github.com/openbmc/phosphor-sel-logger
51b5cddbb2SClaire Weinan
52b5cddbb2SClaire Weinan- FRU fault manager controls the blinking of LEDs when faults occur:
53b5cddbb2SClaire Weinan  https://github.com/openbmc/phosphor-led-manager/blob/master/fault-monitor/fru-fault-monitor.hpp
54b5cddbb2SClaire Weinan
55b5cddbb2SClaire Weinan- Guard On BMC records and manages a list of faulty components for isolation.
56b5cddbb2SClaire Weinan  (Both the host and the BMC may identify faulty components and create guard
57b5cddbb2SClaire Weinan  records for them):
58b5cddbb2SClaire Weinan  https://github.com/openbmc/docs/blob/9c79837a8a20dc8e131cc8f046d1ceb4a731391a/designs/guard-on-bmc.md
59b5cddbb2SClaire Weinan
60b5cddbb2SClaire WeinanThere is an OpenCompute Fault Management Infrastructure proposal that also
61b5cddbb2SClaire Weinanrecommends delivering error logs from the BMC:
62b5cddbb2SClaire Weinanhttps://drive.google.com/file/d/1A9Qc7hB3THw0wiEK_dbXYj85_NOJWrb5/
63b5cddbb2SClaire Weinan
64b5cddbb2SClaire Weinan## Requirements
65*f4febd00SPatrick Williams
66b5cddbb2SClaire Weinan- The users of this solution are Redfish clients in data center software. The
67b5cddbb2SClaire Weinan  goal of the fault monitor is to enable rich error logging (OEM and CPU vendor
68b5cddbb2SClaire Weinan  specific) for data center tools to monitor servers, manage repairs, predict
69b5cddbb2SClaire Weinan  crashes, etc.
70b5cddbb2SClaire Weinan
71b5cddbb2SClaire Weinan- The fault monitor must be able to handle receiving fault information that is
72*f4febd00SPatrick Williams  polled periodically as well as fault information that may come in sporadically
73*f4febd00SPatrick Williams  based on fault incidents (e.g. crash dumps).
74b5cddbb2SClaire Weinan
75b5cddbb2SClaire Weinan- The fault monitor should allow for logging of a variety of sizes of fault
76b5cddbb2SClaire Weinan  information entries (on the order of bytes to megabytes). In general, more
77b5cddbb2SClaire Weinan  severe errors which require more fault information to be collected tend to
78b5cddbb2SClaire Weinan  occur less frequently, while less severe errors such as correctable errors
79b5cddbb2SClaire Weinan  require less logging but may happen more frequently.
80b5cddbb2SClaire Weinan
81b5cddbb2SClaire Weinan- Fault information must be added to a Redfish LogService in a timely manner
82b5cddbb2SClaire Weinan  (within a few seconds of the original event) to be available to external data
83b5cddbb2SClaire Weinan  center monitoring software.
84b5cddbb2SClaire Weinan
85b5cddbb2SClaire Weinan- The fault monitor must allow for custom overwrite rules for its log entries
86b5cddbb2SClaire Weinan  (e.g. on overflow, save first errors and more severe errors), or guarantee
87b5cddbb2SClaire Weinan  that enough space is available in its log such that all data from the most
88b5cddbb2SClaire Weinan  recent couple of hours is always kept intact. The log does not have to be
89b5cddbb2SClaire Weinan  stored persistently (though it can be).
90b5cddbb2SClaire Weinan
91b5cddbb2SClaire Weinan## Proposed Design
92*f4febd00SPatrick Williams
93b5cddbb2SClaire WeinanA generic fault monitor will be created to collect fault information. First we
94b5cddbb2SClaire Weinandiscuss a few example use cases:
95b5cddbb2SClaire Weinan
96b5cddbb2SClaire Weinan- On CATERR, the Host Error Monitor requests a crash dump (this is an existing
97b5cddbb2SClaire Weinan  capability). The crash dump includes chipset registers but doesn’t include
98b5cddbb2SClaire Weinan  platform-specific system-level data. The fault monitor would therefore
99b5cddbb2SClaire Weinan  additionally collect system-level data such as clock, thermal, and power
100b5cddbb2SClaire Weinan  information. This information would be bundled, logged, and associated with
101b5cddbb2SClaire Weinan  the crash dump so that it could be post-processed by data center monitoring
102b5cddbb2SClaire Weinan  tools without having to join multiple data sources.
103b5cddbb2SClaire Weinan
104b5cddbb2SClaire Weinan- The fault monitor would monitor link level retries and link retrainings of
105b5cddbb2SClaire Weinan  high speed serial links such as UPI links. This isn’t typically monitored by
106b5cddbb2SClaire Weinan  the host kernel at runtime and the host kernel isn’t able to log it during a
107b5cddbb2SClaire Weinan  crash. The fault monitor in the BMC could check link level retries and link
108b5cddbb2SClaire Weinan  retrainings during runtime by polling over PECI. If a MCERR or IERR occurred,
109b5cddbb2SClaire Weinan  the fault monitor could then add additional information such as high speed
110b5cddbb2SClaire Weinan  serial link statistics to error logs.
111b5cddbb2SClaire Weinan
112b5cddbb2SClaire Weinan- In order to monitor memory out of band, a system could be configured to give
113b5cddbb2SClaire Weinan  the BMC exclusive access to memory error logging registers (to prevent the
114b5cddbb2SClaire Weinan  host kernel from being able to access and clear the registers before the BMC
115b5cddbb2SClaire Weinan  could collect the register data). For corrected memory errors, the fault
116b5cddbb2SClaire Weinan  monitor could log error registers either through polling or interrupts. Data
117b5cddbb2SClaire Weinan  center monitoring tools would use the logs to determine whether memory should
118b5cddbb2SClaire Weinan  be swapped or a machine should be removed from usage.
119b5cddbb2SClaire Weinan
120b5cddbb2SClaire WeinanThe fault monitor will not have its own dedicated OpenBMC repository, but will
121b5cddbb2SClaire Weinanconsist of components incorporated into the existing repositories
122b5cddbb2SClaire Weinanhost-error-monitor, bmcweb, and phosphor-debug-collector.
123b5cddbb2SClaire Weinan
124b5cddbb2SClaire WeinanIn the existing Host Error Monitor module, new monitors will be created to add
125b5cddbb2SClaire Weinanfunctionality needed for the fault monitor. For instance, based on the needs of
126b5cddbb2SClaire Weinanthe OEM, the fault monitor will register to be notified of D-Bus signals of
127b5cddbb2SClaire Weinaninterest in order to be alerted when fault events occur. The fault monitor will
128*f4febd00SPatrick Williamsalso poll registers of interest and log their values to the fault log (described
129*f4febd00SPatrick Williamsmore later). In addition, the host will be able to write fault information to
130*f4febd00SPatrick Williamsthe fault log (via a POST (Create) request to its corresponding Redfish log
131*f4febd00SPatrick Williamsresource collection). When the fault monitor becomes aware of a new fault
132*f4febd00SPatrick Williamsoccurrence through any of these ways, it may add fault information to the fault
133*f4febd00SPatrick Williamslog. The fault monitor may also gather relevant sensor data (read via D-Bus from
134*f4febd00SPatrick Williamsthe dbus-sensors services) and add it to the fault log, with a reference to the
135*f4febd00SPatrick Williamsoriginal fault event information. The EventGroupID in a Redfish LogEntry could
136*f4febd00SPatrick Williamspotentially be used to associate multiple log entries related to the same fault
137*f4febd00SPatrick Williamsevent.
138b5cddbb2SClaire Weinan
139b5cddbb2SClaire WeinanThe fault log for storing relevant fault information (and exposing it to
140b5cddbb2SClaire Weinanexternal data center monitoring software) will be a new Redfish LogService
141b5cddbb2SClaire Weinan(/redfish/v1/Systems/system/LogServices/FaultLog) with
142b5cddbb2SClaire Weinan`OverwritePolicy=unknown`, in order to implement custom overwrite rules such as
143b5cddbb2SClaire Weinanprioritizing retaining first and/or more severe faults. The back end
144b5cddbb2SClaire Weinanimplementation of the fault log including saving and managing log files will be
145b5cddbb2SClaire Weinanadded into the existing Phosphor Debug Collector repository with an associated
146b5cddbb2SClaire WeinanD-bus object (e.g. xyz/openbmc_project/dump/faultlog) whose interface will
147*f4febd00SPatrick Williamsinclude methods for writing new data into the log, retrieving data from the log,
148*f4febd00SPatrick Williamsand clearing the log. The fault log will be implemented as a new dump type in an
149*f4febd00SPatrick Williamsexisting Phosphor Debug Collector daemon (specifically the one whose main()
150*f4febd00SPatrick Williamsfunction is in dump_manager_main.cpp). The new fault log would contain dump
151*f4febd00SPatrick Williamsfiles that are collected in a variety of ways in a variety of formats. A new
152*f4febd00SPatrick Williamsfault log dump entry class (deriving from the "Entry" class in dump_entry.hpp)
153*f4febd00SPatrick Williamswould be defined with an additional "dump type" member variable to identify the
154*f4febd00SPatrick Williamstype of data that a fault log dump entry's corresponding dump file contains.
155b5cddbb2SClaire Weinan
156*f4febd00SPatrick Williamsbmcweb will be used as the associated Redfish webserver for external entities to
157*f4febd00SPatrick Williamsread and write the fault log. Functionality for handling a POST (Create) request
158*f4febd00SPatrick Williamsto a Redfish log resource collection will be added in bmcweb. When delivering a
159*f4febd00SPatrick WilliamsRedfish fault log entry to a Redfish client, large-sized fault information (e.g.
160*f4febd00SPatrick Williamscrashdumps) can be specified as an attachment sub-resource (AdditionalDataURI)
161*f4febd00SPatrick Williamsinstead of being inlined. Redfish events (EventService schema) will be used to
162*f4febd00SPatrick Williamssend external notifications, such as when the fault monitor needs to notify
163*f4febd00SPatrick Williamsexternal data center monitoring software of new fault information being
164*f4febd00SPatrick Williamsavailable. Redfish events may also be used to notify the host kernel and/or BIOS
165*f4febd00SPatrick Williamsof any repair actions that need to be triggered based on the latest fault
166*f4febd00SPatrick Williamsinformation.
167b5cddbb2SClaire Weinan
168b5cddbb2SClaire Weinan## Alternatives Considered
169*f4febd00SPatrick Williams
170b5cddbb2SClaire WeinanWe considered adding the fault logs into the main system event log
171*f4febd00SPatrick Williams(/redfish/v1/Systems/system/LogServices/EventLog) or other logs already existing
172*f4febd00SPatrick Williamsin bmcweb (e.g. /redfish/v1/Systems/system/LogServices/Dump,
173b5cddbb2SClaire Weinan/redfish/v1/Managers/bmc/LogServices/Dump), but we would like to implement a
174b5cddbb2SClaire Weinanseparate custom overwrite policy to ensure the most important information (such
175b5cddbb2SClaire Weinanas first errors and most severe errors) is retained for local analysis.
176b5cddbb2SClaire Weinan
177b5cddbb2SClaire Weinan## Impacts
178*f4febd00SPatrick Williams
179b5cddbb2SClaire WeinanThere may be situations where external consumers of fault monitor logs (e.g.
180b5cddbb2SClaire Weinandata center monitoring tools) are running software that is newer or older than
181b5cddbb2SClaire Weinanthe version matching the BMC software running on a machine. In such cases,
182b5cddbb2SClaire Weinanconsumers can ignore any types of fault information provided by the fault
183b5cddbb2SClaire Weinanmonitor that they are not prepared to handle.
184b5cddbb2SClaire Weinan
185b5cddbb2SClaire WeinanErrors are expected to happen infrequently, or to be throttled, so we expect
186b5cddbb2SClaire Weinanlittle to no performance impact.
187b5cddbb2SClaire Weinan
188b5cddbb2SClaire Weinan## Testing
189*f4febd00SPatrick Williams
190b5cddbb2SClaire WeinanError injection mechanisms or simulations may be used to artificially create
191b5cddbb2SClaire Weinanerror conditions that will be logged by the fault monitor module.
192b5cddbb2SClaire Weinan
193b5cddbb2SClaire WeinanThere is no significant impact expected with regards to CI testing, but we do
194b5cddbb2SClaire Weinanintend to add unit testing for the fault monitor.
195