xref: /openbmc/docs/designs/fail-boot-on-hw-error.md (revision f4febd002df578bad816239b70950f84ea4567e8)
1f9b334a7SAndrew Geissler# Fail Boot on Hardware Errors
2f9b334a7SAndrew Geissler
3f9b334a7SAndrew GeisslerAuthor: Andrew Geissler (geissonator)
4f9b334a7SAndrew Geissler
5f9b334a7SAndrew GeisslerOther contributors:
6f9b334a7SAndrew Geissler
7*f4febd00SPatrick WilliamsCreated: Feb 20, 2020 Updated: Apr 12, 2022
8f9b334a7SAndrew Geissler
9f9b334a7SAndrew Geissler## Problem Description
10*f4febd00SPatrick Williams
11f9b334a7SAndrew GeisslerSome groups, for example a manufacturing team, have a requirement for the BMC
12f9b334a7SAndrew Geisslerfirmware to halt a system if an error log is created which calls out a piece of
13f9b334a7SAndrew Geisslerhardware. The reason behind this is to ensure a system is not shipped to a
14f9b334a7SAndrew Geisslercustomer if it has any type of hardware issue. It also ensures when an error is
15f9b334a7SAndrew Geisslerfound, it is identified quickly and all activity stops until the issue is fixed.
16f9b334a7SAndrew GeisslerIf the system has a hardware issue once shipped from manufacturing, then the BMC
17f9b334a7SAndrew Geisslerfirmware behavior should be to report the error, but allow the system to
18f9b334a7SAndrew Geisslercontinue to boot and operate.
19f9b334a7SAndrew Geissler
20f9b334a7SAndrew GeisslerOpenBMC firmware needs a mechanism to support this use case.
21f9b334a7SAndrew Geissler
22f9b334a7SAndrew Geissler## Background and References
23*f4febd00SPatrick Williams
24f9b334a7SAndrew GeisslerWithin IBM, this function has been enabled/disabled by what is called
25f9b334a7SAndrew Geisslermanufacturing flags. They were bits the user could set in registry variables
26*f4febd00SPatrick Williamswhich the firmware would then query. These registry variables were only settable
27*f4febd00SPatrick Williamsby someone with admin authority to the system. These flags were not used outside
28*f4febd00SPatrick Williamsof manufacturing and test.
29f9b334a7SAndrew Geissler
30f9b334a7SAndrew GeisslerExtensions within phosphor-logging may process logs that do not always come
31*f4febd00SPatrick Williamsthrough the standard phosphor-logging interfaces (for example logs sent down by
32*f4febd00SPatrick Williamsthe host). In these cases the system must still halt if those logs contain
33*f4febd00SPatrick Williamshardware callouts.
34f9b334a7SAndrew Geissler
35f9b334a7SAndrew Geissler[This][1] email thread was sent on this topic to the list.
36f9b334a7SAndrew Geissler
37f9b334a7SAndrew Geissler## Requirements
38*f4febd00SPatrick Williams
39f9b334a7SAndrew Geissler- Provide a mechanism to cause the OpenBMC firmware to halt a system if a
40f9b334a7SAndrew Geissler  phosphor-logging log is created with a inventory callout
41*f4febd00SPatrick Williams  - The mechanism to enable/disable this feature does not need to be an external
42*f4febd00SPatrick Williams    API (i.e. Redfish). It can simply be a busctl command one runs in an ssh to
43*f4febd00SPatrick Williams    the BMC
44f9b334a7SAndrew Geissler  - The halt must be obvious to the user when it occurs
45f9b334a7SAndrew Geissler    - The log which causes the halt must be identifiable
46f9b334a7SAndrew Geissler  - The halt must only stop the chassis/host instance that encountered the error
4726119113SAndrew Geissler  - The halt must allow the host firmware the opportunity to gracefully shut
4826119113SAndrew Geissler    itself down
49f9b334a7SAndrew Geissler  - The halt must stop the host (run obmc-host-stop@X.target) associated with
50f9b334a7SAndrew Geissler    the error and attempt to leave system in the fail state (i.e. chassis power
51f9b334a7SAndrew Geissler    remains on if it is on)
52*f4febd00SPatrick Williams  - The chassis/host instance pair will not be allowed to power on until the log
53*f4febd00SPatrick Williams    that caused the halt is resolved or deleted
54f9b334a7SAndrew Geissler    - A BMC reset will clear this power on prevention
556ebb9bd1SAndrew Geissler- Ensure the mechanism used to halt firmware on inventory callouts can also be
566ebb9bd1SAndrew Geissler  utilized by phosphor-logging extensions to halt firmware for other causes
576ebb9bd1SAndrew Geissler  - These causes will be defined within the extensions documentation
58f9b334a7SAndrew Geissler- Quiesce the associated host during this failure
59f9b334a7SAndrew Geissler
60*f4febd00SPatrick Williams**Special Note:** Initially the associated host and chassis will be hard coded
61*f4febd00SPatrick Williamsto chassis0 and host0. More work throughout the BMC stack is required to handle
62f9b334a7SAndrew Geisslermultiple chassis and hosts. This design allows that type of feature to be
63f9b334a7SAndrew Geisslerenabled at a later time.
64f9b334a7SAndrew Geissler
65f9b334a7SAndrew Geissler## Proposed Design
66*f4febd00SPatrick Williams
67f9b334a7SAndrew GeisslerCreate a [phosphor-settingsd][2] setting,
68f9b334a7SAndrew Geissler`xyz.openbmc_project.Logging.Settings`. Within this create a boolean property
69f9b334a7SAndrew Geisslercalled QuiesceOnHwError. This property will be hosted under the
70f9b334a7SAndrew Geisslerxyz.openbmc_project.Settings service.
71f9b334a7SAndrew Geissler
72f9b334a7SAndrew GeisslerDefine a new D-Bus interface which will indicate an error has been created which
73f9b334a7SAndrew Geisslerwill prevent the boot of a chassis/host instance:
74f9b334a7SAndrew Geissler`xyz.openbmc_project.Logging.ErrorBlocksTransition`
75f9b334a7SAndrew Geissler
76f9b334a7SAndrew GeisslerThis interface will be hosted under a instance based D-Bus object
77f9b334a7SAndrew Geissler`/xyz/openbmc_project/logging/blockX` where X is the instance of the
78f9b334a7SAndrew Geisslerchassis/host pair being blocked.
79f9b334a7SAndrew Geissler
80f9b334a7SAndrew GeisslerWhen an error is created via a phosphor-logging interface, the software will
81f9b334a7SAndrew Geisslercheck to see if the error has a callout, and if so it will check the new
82f9b334a7SAndrew Geissler`xyz.openbmc_project.Logging.Settings.QuiesceOnHwError`. If this is true then
83f9b334a7SAndrew Geisslerphosphor-logging will create a `/xyz/openbmc_project/logging/blockX` D-Bus
84f9b334a7SAndrew Geisslerobject with a `xyz.openbmc_project.Logging.ErrorBlocksTransition` interface
85*f4febd00SPatrick Williamsunder it. A mapper [association][3] between the log and this new D-Bus object
86*f4febd00SPatrick Williamswill be created. The corresponding host instance will be put in quiesce by
87*f4febd00SPatrick Williamsphosphor-logging.
88f9b334a7SAndrew Geissler
89f9b334a7SAndrew GeisslerThe blocked state can be exited by rebooting the BMC or clearing the log
90*f4febd00SPatrick Williamsresponsible for the blocking. Other system specific policies could be placed in
91*f4febd00SPatrick Williamsthe appropriate targets (for example if a chassis power off should clear the
92*f4febd00SPatrick Williamsblock)
93f9b334a7SAndrew Geissler
94f9b334a7SAndrew GeisslerSee the phosphor-logging [callout][4] design for more information on callouts.
95f9b334a7SAndrew Geissler
96*f4febd00SPatrick WilliamsA new `obmc-host-graceful-quiesce@.target` systemd target will be started. This
97*f4febd00SPatrick Williamsnew target will ensure a graceful shutdown of the host is initated and then
98*f4febd00SPatrick Williamsstart the `obmc-host-quiesce@.target` which will stop the host and move the host
99*f4febd00SPatrick Williamsstate to Quiesced.
100f9b334a7SAndrew Geissler
101*f4febd00SPatrick Williamsobmcutil will be enhanced to look for these block interfaces and notify the user
102*f4febd00SPatrick Williamsvia the `obmcutil state` command if a block is enabled and what log is
103*f4febd00SPatrick Williamsassociated with it.
104f9b334a7SAndrew Geissler
105f9b334a7SAndrew GeisslerThe goal is to build upon this concept when future design work is done to allow
106f9b334a7SAndrew Geisslerdevelopers to associate certain error logs with causing a halt to the system
107f9b334a7SAndrew Geissleruntil a log is handled.
108f9b334a7SAndrew Geissler
1096ebb9bd1SAndrew Geissler## Host Errors
1106ebb9bd1SAndrew Geissler
111*f4febd00SPatrick WilliamsIn certain scenarios, it is desirable to also halt the boot, and prevent it from
112*f4febd00SPatrick Williamsrebooting, when the host sends down certain errors to the BMC.
1136ebb9bd1SAndrew Geissler
114*f4febd00SPatrick WilliamsThese errors may be of SEL format, or may be OEM specific, such as the [PEL
115*f4febd00SPatrick Williamsformat][5] used by IBM.
1166ebb9bd1SAndrew Geissler
1176ebb9bd1SAndrew GeisslerThe interfaces provided within phosphor-logging to handle the hardware callout
1186ebb9bd1SAndrew Geisslerscenarios can be repurposed for this use case.
1196ebb9bd1SAndrew Geissler
120f9b334a7SAndrew Geissler## Alternatives Considered
121*f4febd00SPatrick Williams
122*f4febd00SPatrick WilliamsCurrently this feature is a part of the base phosphor-logging design. If no one
123*f4febd00SPatrick Williamsother then IBM sees value, we could roll this into the PEL-specific portion of
124*f4febd00SPatrick Williamsphosphor-logging.
125f9b334a7SAndrew Geissler
126f9b334a7SAndrew Geissler## Impacts
127*f4febd00SPatrick Williams
128f9b334a7SAndrew GeisslerThis will require some additional checking on reported logs but should have
129f9b334a7SAndrew Geisslerminimal overhead.
130f9b334a7SAndrew Geissler
131f9b334a7SAndrew GeisslerThere will be no changes to system behavior unless a user turns on this new
132f9b334a7SAndrew Geisslersetting.
133f9b334a7SAndrew Geissler
134f9b334a7SAndrew Geissler## Testing
135*f4febd00SPatrick Williams
136f9b334a7SAndrew GeisslerUnit tests will be run to ensure logic to detect errors with logs and verify
137f9b334a7SAndrew Geisslerboth possible values of the new setting.
138f9b334a7SAndrew Geissler
139f9b334a7SAndrew GeisslerTest cases will need to look for this new blocking D-Bus object and handle
140f9b334a7SAndrew Geisslerappropriately.
141f9b334a7SAndrew Geissler
142f9b334a7SAndrew Geissler[1]: https://lists.ozlabs.org/pipermail/openbmc/2020-February/020575.html
143f9b334a7SAndrew Geissler[2]: https://github.com/openbmc/phosphor-settingsd
144*f4febd00SPatrick Williams[3]:
145*f4febd00SPatrick Williams  https://github.com/openbmc/docs/blob/master/architecture/object-mapper.md#associations
146*f4febd00SPatrick Williams[4]:
147*f4febd00SPatrick Williams  https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Common/Callout/README.md
148*f4febd00SPatrick Williams[5]:
149*f4febd00SPatrick Williams  https://github.com/openbmc/phosphor-logging/blob/master/extensions/openpower-pels/README.md
150