xref: /openbmc/docs/designs/fail-boot-on-hw-error.md (revision f4febd002df578bad816239b70950f84ea4567e8)
1# Fail Boot on Hardware Errors
2
3Author: Andrew Geissler (geissonator)
4
5Other contributors:
6
7Created: Feb 20, 2020 Updated: Apr 12, 2022
8
9## Problem Description
10
11Some groups, for example a manufacturing team, have a requirement for the BMC
12firmware to halt a system if an error log is created which calls out a piece of
13hardware. The reason behind this is to ensure a system is not shipped to a
14customer if it has any type of hardware issue. It also ensures when an error is
15found, it is identified quickly and all activity stops until the issue is fixed.
16If the system has a hardware issue once shipped from manufacturing, then the BMC
17firmware behavior should be to report the error, but allow the system to
18continue to boot and operate.
19
20OpenBMC firmware needs a mechanism to support this use case.
21
22## Background and References
23
24Within IBM, this function has been enabled/disabled by what is called
25manufacturing flags. They were bits the user could set in registry variables
26which the firmware would then query. These registry variables were only settable
27by someone with admin authority to the system. These flags were not used outside
28of manufacturing and test.
29
30Extensions within phosphor-logging may process logs that do not always come
31through the standard phosphor-logging interfaces (for example logs sent down by
32the host). In these cases the system must still halt if those logs contain
33hardware callouts.
34
35[This][1] email thread was sent on this topic to the list.
36
37## Requirements
38
39- Provide a mechanism to cause the OpenBMC firmware to halt a system if a
40  phosphor-logging log is created with a inventory callout
41  - The mechanism to enable/disable this feature does not need to be an external
42    API (i.e. Redfish). It can simply be a busctl command one runs in an ssh to
43    the BMC
44  - The halt must be obvious to the user when it occurs
45    - The log which causes the halt must be identifiable
46  - The halt must only stop the chassis/host instance that encountered the error
47  - The halt must allow the host firmware the opportunity to gracefully shut
48    itself down
49  - The halt must stop the host (run obmc-host-stop@X.target) associated with
50    the error and attempt to leave system in the fail state (i.e. chassis power
51    remains on if it is on)
52  - The chassis/host instance pair will not be allowed to power on until the log
53    that caused the halt is resolved or deleted
54    - A BMC reset will clear this power on prevention
55- Ensure the mechanism used to halt firmware on inventory callouts can also be
56  utilized by phosphor-logging extensions to halt firmware for other causes
57  - These causes will be defined within the extensions documentation
58- Quiesce the associated host during this failure
59
60**Special Note:** Initially the associated host and chassis will be hard coded
61to chassis0 and host0. More work throughout the BMC stack is required to handle
62multiple chassis and hosts. This design allows that type of feature to be
63enabled at a later time.
64
65## Proposed Design
66
67Create a [phosphor-settingsd][2] setting,
68`xyz.openbmc_project.Logging.Settings`. Within this create a boolean property
69called QuiesceOnHwError. This property will be hosted under the
70xyz.openbmc_project.Settings service.
71
72Define a new D-Bus interface which will indicate an error has been created which
73will prevent the boot of a chassis/host instance:
74`xyz.openbmc_project.Logging.ErrorBlocksTransition`
75
76This interface will be hosted under a instance based D-Bus object
77`/xyz/openbmc_project/logging/blockX` where X is the instance of the
78chassis/host pair being blocked.
79
80When an error is created via a phosphor-logging interface, the software will
81check to see if the error has a callout, and if so it will check the new
82`xyz.openbmc_project.Logging.Settings.QuiesceOnHwError`. If this is true then
83phosphor-logging will create a `/xyz/openbmc_project/logging/blockX` D-Bus
84object with a `xyz.openbmc_project.Logging.ErrorBlocksTransition` interface
85under it. A mapper [association][3] between the log and this new D-Bus object
86will be created. The corresponding host instance will be put in quiesce by
87phosphor-logging.
88
89The blocked state can be exited by rebooting the BMC or clearing the log
90responsible for the blocking. Other system specific policies could be placed in
91the appropriate targets (for example if a chassis power off should clear the
92block)
93
94See the phosphor-logging [callout][4] design for more information on callouts.
95
96A new `obmc-host-graceful-quiesce@.target` systemd target will be started. This
97new target will ensure a graceful shutdown of the host is initated and then
98start the `obmc-host-quiesce@.target` which will stop the host and move the host
99state to Quiesced.
100
101obmcutil will be enhanced to look for these block interfaces and notify the user
102via the `obmcutil state` command if a block is enabled and what log is
103associated with it.
104
105The goal is to build upon this concept when future design work is done to allow
106developers to associate certain error logs with causing a halt to the system
107until a log is handled.
108
109## Host Errors
110
111In certain scenarios, it is desirable to also halt the boot, and prevent it from
112rebooting, when the host sends down certain errors to the BMC.
113
114These errors may be of SEL format, or may be OEM specific, such as the [PEL
115format][5] used by IBM.
116
117The interfaces provided within phosphor-logging to handle the hardware callout
118scenarios can be repurposed for this use case.
119
120## Alternatives Considered
121
122Currently this feature is a part of the base phosphor-logging design. If no one
123other then IBM sees value, we could roll this into the PEL-specific portion of
124phosphor-logging.
125
126## Impacts
127
128This will require some additional checking on reported logs but should have
129minimal overhead.
130
131There will be no changes to system behavior unless a user turns on this new
132setting.
133
134## Testing
135
136Unit tests will be run to ensure logic to detect errors with logs and verify
137both possible values of the new setting.
138
139Test cases will need to look for this new blocking D-Bus object and handle
140appropriately.
141
142[1]: https://lists.ozlabs.org/pipermail/openbmc/2020-February/020575.html
143[2]: https://github.com/openbmc/phosphor-settingsd
144[3]:
145  https://github.com/openbmc/docs/blob/master/architecture/object-mapper.md#associations
146[4]:
147  https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Common/Callout/README.md
148[5]:
149  https://github.com/openbmc/phosphor-logging/blob/master/extensions/openpower-pels/README.md
150