1# Fail Boot on Hardware Errors 2 3Author: Andrew Geissler (geissonator) 4 5Other contributors: 6 7Created: Feb 20, 2020 8Updated: Apr 12, 2022 9 10## Problem Description 11Some groups, for example a manufacturing team, have a requirement for the BMC 12firmware to halt a system if an error log is created which calls out a piece of 13hardware. The reason behind this is to ensure a system is not shipped to a 14customer if it has any type of hardware issue. It also ensures when an error is 15found, it is identified quickly and all activity stops until the issue is fixed. 16If the system has a hardware issue once shipped from manufacturing, then the BMC 17firmware behavior should be to report the error, but allow the system to 18continue to boot and operate. 19 20OpenBMC firmware needs a mechanism to support this use case. 21 22## Background and References 23Within IBM, this function has been enabled/disabled by what is called 24manufacturing flags. They were bits the user could set in registry variables 25which the firmware would then query. These registry variables were only 26settable by someone with admin authority to the system. These flags were not 27used outside of manufacturing and test. 28 29Extensions within phosphor-logging may process logs that do not always come 30through the standard phosphor-logging interfaces (for example logs sent 31down by the host). In these cases the system must still halt if those logs 32contain hardware callouts. 33 34[This][1] email thread was sent on this topic to the list. 35 36## Requirements 37- Provide a mechanism to cause the OpenBMC firmware to halt a system if a 38 phosphor-logging log is created with a inventory callout 39 - The mechanism to enable/disable this feature does not need to be an 40 external API (i.e. Redfish). It can simply be a busctl command one runs 41 in an ssh to the BMC 42 - The halt must be obvious to the user when it occurs 43 - The log which causes the halt must be identifiable 44 - The halt must only stop the chassis/host instance that encountered the error 45 - The halt must allow the host firmware the opportunity to gracefully shut 46 itself down 47 - The halt must stop the host (run obmc-host-stop@X.target) associated with 48 the error and attempt to leave system in the fail state (i.e. chassis power 49 remains on if it is on) 50 - The chassis/host instance pair will not be allowed to power on until 51 the log that caused the halt is resolved or deleted 52 - A BMC reset will clear this power on prevention 53- Ensure the mechanism used to halt firmware on inventory callouts can also be 54 utilized by phosphor-logging extensions to halt firmware for other causes 55 - These causes will be defined within the extensions documentation 56- Quiesce the associated host during this failure 57 58**Special Note:** Initially the associated host and chassis will be hard coded to 59chassis0 and host0. More work throughout the BMC stack is required to handle 60multiple chassis and hosts. This design allows that type of feature to be 61enabled at a later time. 62 63## Proposed Design 64Create a [phosphor-settingsd][2] setting, 65`xyz.openbmc_project.Logging.Settings`. Within this create a boolean property 66called QuiesceOnHwError. This property will be hosted under the 67xyz.openbmc_project.Settings service. 68 69Define a new D-Bus interface which will indicate an error has been created which 70will prevent the boot of a chassis/host instance: 71`xyz.openbmc_project.Logging.ErrorBlocksTransition` 72 73This interface will be hosted under a instance based D-Bus object 74`/xyz/openbmc_project/logging/blockX` where X is the instance of the 75chassis/host pair being blocked. 76 77When an error is created via a phosphor-logging interface, the software will 78check to see if the error has a callout, and if so it will check the new 79`xyz.openbmc_project.Logging.Settings.QuiesceOnHwError`. If this is true then 80phosphor-logging will create a `/xyz/openbmc_project/logging/blockX` D-Bus 81object with a `xyz.openbmc_project.Logging.ErrorBlocksTransition` interface 82under it. A mapper [association][3] between the log and this new D-Bus 83object will be created. The corresponding host instance will be put 84in quiesce by phosphor-logging. 85 86The blocked state can be exited by rebooting the BMC or clearing the log 87responsible for the blocking. Other system specific policies could be placed 88in the appropriate targets (for example if a chassis power off should clear 89the block) 90 91See the phosphor-logging [callout][4] design for more information on callouts. 92 93A new `obmc-host-graceful-quiesce@.target` systemd target will be started. 94This new target will ensure a graceful shutdown of the host is initated 95and then start the `obmc-host-quiesce@.target` which will stop the host 96and move the host state to Quiesced. 97 98obmcutil will be enhanced to look for these block interfaces and notify the 99user via the `obmcutil state` command if a block is enabled and what log 100is associated with it. 101 102The goal is to build upon this concept when future design work is done to allow 103developers to associate certain error logs with causing a halt to the system 104until a log is handled. 105 106## Host Errors 107 108In certain scenarios, it is desirable to also halt the boot, and prevent it 109from rebooting, when the host sends down certain errors to the BMC. 110 111These errors may be of SEL format, or may be OEM specific, such as the 112[PEL format][5] used by IBM. 113 114The interfaces provided within phosphor-logging to handle the hardware callout 115scenarios can be repurposed for this use case. 116 117## Alternatives Considered 118Currently this feature is a part of the base phosphor-logging design. If no 119one other then IBM sees value, we could roll this into the PEL-specific 120portion of phosphor-logging. 121 122## Impacts 123This will require some additional checking on reported logs but should have 124minimal overhead. 125 126There will be no changes to system behavior unless a user turns on this new 127setting. 128 129## Testing 130Unit tests will be run to ensure logic to detect errors with logs and verify 131both possible values of the new setting. 132 133Test cases will need to look for this new blocking D-Bus object and handle 134appropriately. 135 136 137[1]: https://lists.ozlabs.org/pipermail/openbmc/2020-February/020575.html 138[2]: https://github.com/openbmc/phosphor-settingsd 139[3]: https://github.com/openbmc/docs/blob/master/architecture/object-mapper.md#associations 140[4]: https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Common/Callout/README.md 141[5]: https://github.com/openbmc/phosphor-logging/blob/master/extensions/openpower-pels/README.md 142