1# Fail Boot on Hardware Errors 2 3Author: Andrew Geissler (geissonator) 4 5Primary assignee: Andrew Geissler (geissonator) 6 7Other contributors: 8 9Created: Feb 20, 2020 10Updated: Apr 12, 2022 11 12## Problem Description 13Some groups, for example a manufacturing team, have a requirement for the BMC 14firmware to halt a system if an error log is created which calls out a piece of 15hardware. The reason behind this is to ensure a system is not shipped to a 16customer if it has any type of hardware issue. It also ensures when an error is 17found, it is identified quickly and all activity stops until the issue is fixed. 18If the system has a hardware issue once shipped from manufacturing, then the BMC 19firmware behavior should be to report the error, but allow the system to 20continue to boot and operate. 21 22OpenBMC firmware needs a mechanism to support this use case. 23 24## Background and References 25Within IBM, this function has been enabled/disabled by what is called 26manufacturing flags. They were bits the user could set in registry variables 27which the firmware would then query. These registry variables were only 28settable by someone with admin authority to the system. These flags were not 29used outside of manufacturing and test. 30 31Extensions within phosphor-logging may process logs that do not always come 32through the standard phosphor-logging interfaces (for example logs sent 33down by the host). In these cases the system must still halt if those logs 34contain hardware callouts. 35 36[This][1] email thread was sent on this topic to the list. 37 38## Requirements 39- Provide a mechanism to cause the OpenBMC firmware to halt a system if a 40 phosphor-logging log is created with a inventory callout 41 - The mechanism to enable/disable this feature does not need to be an 42 external API (i.e. Redfish). It can simply be a busctl command one runs 43 in an ssh to the BMC 44 - The halt must be obvious to the user when it occurs 45 - The log which causes the halt must be identifiable 46 - The halt must only stop the chassis/host instance that encountered the error 47 - The halt must allow the host firmware the opportunity to gracefully shut 48 itself down 49 - The halt must stop the host (run obmc-host-stop@X.target) associated with 50 the error and attempt to leave system in the fail state (i.e. chassis power 51 remains on if it is on) 52 - The chassis/host instance pair will not be allowed to power on until 53 the log that caused the halt is resolved or deleted 54 - A BMC reset will clear this power on prevention 55- Ensure the mechanism used to halt firmware on inventory callouts can also be 56 utilized by phosphor-logging extensions to halt firmware for other causes 57 - These causes will be defined within the extensions documentation 58- Quiesce the associated host during this failure 59 60**Special Note:** Initially the associated host and chassis will be hard coded to 61chassis0 and host0. More work throughout the BMC stack is required to handle 62multiple chassis and hosts. This design allows that type of feature to be 63enabled at a later time. 64 65## Proposed Design 66Create a [phosphor-settingsd][2] setting, 67`xyz.openbmc_project.Logging.Settings`. Within this create a boolean property 68called QuiesceOnHwError. This property will be hosted under the 69xyz.openbmc_project.Settings service. 70 71Define a new D-Bus interface which will indicate an error has been created which 72will prevent the boot of a chassis/host instance: 73`xyz.openbmc_project.Logging.ErrorBlocksTransition` 74 75This interface will be hosted under a instance based D-Bus object 76`/xyz/openbmc_project/logging/blockX` where X is the instance of the 77chassis/host pair being blocked. 78 79When an error is created via a phosphor-logging interface, the software will 80check to see if the error has a callout, and if so it will check the new 81`xyz.openbmc_project.Logging.Settings.QuiesceOnHwError`. If this is true then 82phosphor-logging will create a `/xyz/openbmc_project/logging/blockX` D-Bus 83object with a `xyz.openbmc_project.Logging.ErrorBlocksTransition` interface 84under it. A mapper [association][3] between the log and this new D-Bus 85object will be created. The corresponding host instance will be put 86in quiesce by phosphor-logging. 87 88The blocked state can be exited by rebooting the BMC or clearing the log 89responsible for the blocking. Other system specific policies could be placed 90in the appropriate targets (for example if a chassis power off should clear 91the block) 92 93See the phosphor-logging [callout][4] design for more information on callouts. 94 95A new `obmc-host-graceful-quiesce@.target` systemd target will be started. 96This new target will ensure a graceful shutdown of the host is initated 97and then start the `obmc-host-quiesce@.target` which will stop the host 98and move the host state to Quiesced. 99 100obmcutil will be enhanced to look for these block interfaces and notify the 101user via the `obmcutil state` command if a block is enabled and what log 102is associated with it. 103 104The goal is to build upon this concept when future design work is done to allow 105developers to associate certain error logs with causing a halt to the system 106until a log is handled. 107 108## Host Errors 109 110In certain scenarios, it is desirable to also halt the boot, and prevent it 111from rebooting, when the host sends down certain errors to the BMC. 112 113These errors may be of SEL format, or may be OEM specific, such as the 114[PEL format][5] used by IBM. 115 116The interfaces provided within phosphor-logging to handle the hardware callout 117scenarios can be repurposed for this use case. 118 119## Alternatives Considered 120Currently this feature is a part of the base phosphor-logging design. If no 121one other then IBM sees value, we could roll this into the PEL-specific 122portion of phosphor-logging. 123 124## Impacts 125This will require some additional checking on reported logs but should have 126minimal overhead. 127 128There will be no changes to system behavior unless a user turns on this new 129setting. 130 131## Testing 132Unit tests will be run to ensure logic to detect errors with logs and verify 133both possible values of the new setting. 134 135Test cases will need to look for this new blocking D-Bus object and handle 136appropriately. 137 138 139[1]: https://lists.ozlabs.org/pipermail/openbmc/2020-February/020575.html 140[2]: https://github.com/openbmc/phosphor-settingsd 141[3]: https://github.com/openbmc/docs/blob/master/architecture/object-mapper.md#associations 142[4]: https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Common/Callout/README.md 143[5]: https://github.com/openbmc/phosphor-logging/blob/master/extensions/openpower-pels/README.md 144