1# Fail Boot on Hardware Errors
2
3Author: Andrew Geissler (geissonator)
4
5Primary assignee: Andrew Geissler (geissonator)
6
7Other contributors:
8
9Created: Feb 20, 2020
10
11## Problem Description
12Some groups, for example a manufacturing team, have a requirement for the BMC
13firmware to halt a system if an error log is created which calls out a piece of
14hardware. The reason behind this is to ensure a system is not shipped to a
15customer if it has any type of hardware issue. It also ensures when an error is
16found, it is identified quickly and all activity stops until the issue is fixed.
17If the system has a hardware issue once shipped from manufacturing, then the BMC
18firmware behavior should be to report the error, but allow the system to
19continue to boot and operate.
20
21OpenBMC firmware needs a mechanism to support this use case.
22
23## Background and References
24Within IBM, this function has been enabled/disabled by what is called
25manufacturing flags. They were bits the user could set in registry variables
26which the firmware would then query. These registry variables were only
27settable by someone with admin authority to the system. These flags were not
28used outside of manufacturing and test.
29
30Extensions within phosphor-logging may process logs that do not always come
31through the standard phosphor-logging interfaces (for example logs sent
32down by the host). In these cases the system must still halt if those logs
33contain hardware callouts.
34
35[This][1] email thread was sent on this topic to the list.
36
37## Requirements
38- Provide a mechanism to cause the OpenBMC firmware to halt a system if a
39  phosphor-logging log is created with a inventory callout
40  - The mechanism to enable/disable this feature does not need to be an
41    external API (i.e. Redfish). It can simply be a busctl command one runs
42    in an ssh to the BMC
43  - The halt must be obvious to the user when it occurs
44    - The log which causes the halt must be identifiable
45  - The halt must only stop the chassis/host instance that encountered the error
46  - The halt must stop the host (run obmc-host-stop@X.target) associated with
47    the error and attempt to leave system in the fail state (i.e. chassis power
48    remains on if it is on)
49  - The chassis/host instance pair will not be allowed to power on until
50    the log that caused the halt is resolved or deleted
51      - A BMC reset will clear this power on prevention
52- Quiesce the associated host during this failure
53
54**Special Note:** Initially the associated host and chassis will be hard coded to
55chassis0 and host0. More work throughout the BMC stack is required to handle
56multiple chassis and hosts. This design allows that type of feature to be
57enabled at a later time.
58
59## Proposed Design
60Create a [phosphor-settingsd][2] setting,
61`xyz.openbmc_project.Logging.Settings`. Within this create a boolean property
62called QuiesceOnHwError. This property will be hosted under the
63xyz.openbmc_project.Settings service.
64
65Define a new D-Bus interface which will indicate an error has been created which
66will prevent the boot of a chassis/host instance:
67`xyz.openbmc_project.Logging.ErrorBlocksTransition`
68
69This interface will be hosted under a instance based D-Bus object
70`/xyz/openbmc_project/logging/blockX` where X is the instance of the
71chassis/host pair being blocked.
72
73When an error is created via a phosphor-logging interface, the software will
74check to see if the error has a callout, and if so it will check the new
75`xyz.openbmc_project.Logging.Settings.QuiesceOnHwError`. If this is true then
76phosphor-logging will create a `/xyz/openbmc_project/logging/blockX` D-Bus
77object with a `xyz.openbmc_project.Logging.ErrorBlocksTransition` interface
78under it. A mapper [association][3] between the log and this new D-Bus
79object will be created. The corresponding host instance will be put
80in quiesce by phosphor-logging.
81
82The blocked state can be exited by rebooting the BMC or clearing the log
83responsible for the blocking. Other system specific policies could be placed
84in the appropriate targets (for example if a chassis power off should clear
85the block)
86
87See the phosphor-logging [callout][4] design for more information on callouts.
88
89The appropriate `obmc-host-stop@.target` instance will also be called when
90`obmc-bmc-quiesce.target` is started. This ensures the host is stopped as soon as
91the error is discovered.
92
93obmcutil will be enhanced to look for these block interfaces and notify the
94user via the `obmcutil state` command if a block is enabled and what log
95is associated with it.
96
97The goal is to build upon this concept when future design work is done to allow
98developers to associate certain error logs with causing a halt to the system
99until a log is handled.
100
101## Alternatives Considered
102Currently this feature is a part of the base phosphor-logging design. If no
103one other then IBM sees value, we could roll this into the PEL-specific
104portion of phosphor-logging.
105
106A systemd target could be created to do the host stop and quiesce (and any
107other system specific things people need) but at this point there doesn't
108seem to be a ton of value in it. Could always be added later if needed.
109
110## Impacts
111This will require some additional checking on reported logs but should have
112minimal overhead.
113
114There will be no changes to system behavior unless a user turns on this new
115setting.
116
117## Testing
118Unit tests will be run to ensure logic to detect errors with logs and verify
119both possible values of the new setting.
120
121Test cases will need to look for this new blocking D-Bus object and handle
122appropriately.
123
124
125[1]: https://lists.ozlabs.org/pipermail/openbmc/2020-February/020575.html
126[2]: https://github.com/openbmc/phosphor-settingsd
127[3]: https://github.com/openbmc/docs/blob/master/architecture/object-mapper.md#associations
128[4]: https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/xyz/openbmc_project/Common/Callout/README.md
129