1f9b334a7SAndrew Geissler# Fail Boot on Hardware Errors 2f9b334a7SAndrew Geissler 3f9b334a7SAndrew GeisslerAuthor: Andrew Geissler (geissonator) 4f9b334a7SAndrew Geissler 5f9b334a7SAndrew GeisslerOther contributors: 6f9b334a7SAndrew Geissler 7*f4febd00SPatrick WilliamsCreated: Feb 20, 2020 Updated: Apr 12, 2022 8f9b334a7SAndrew Geissler 9f9b334a7SAndrew Geissler## Problem Description 10*f4febd00SPatrick Williams 11f9b334a7SAndrew GeisslerSome groups, for example a manufacturing team, have a requirement for the BMC 12f9b334a7SAndrew Geisslerfirmware to halt a system if an error log is created which calls out a piece of 13f9b334a7SAndrew Geisslerhardware. The reason behind this is to ensure a system is not shipped to a 14f9b334a7SAndrew Geisslercustomer if it has any type of hardware issue. It also ensures when an error is 15f9b334a7SAndrew Geisslerfound, it is identified quickly and all activity stops until the issue is fixed. 16f9b334a7SAndrew GeisslerIf the system has a hardware issue once shipped from manufacturing, then the BMC 17f9b334a7SAndrew Geisslerfirmware behavior should be to report the error, but allow the system to 18f9b334a7SAndrew Geisslercontinue to boot and operate. 19f9b334a7SAndrew Geissler 20f9b334a7SAndrew GeisslerOpenBMC firmware needs a mechanism to support this use case. 21f9b334a7SAndrew Geissler 22f9b334a7SAndrew Geissler## Background and References 23*f4febd00SPatrick Williams 24f9b334a7SAndrew GeisslerWithin IBM, this function has been enabled/disabled by what is called 25f9b334a7SAndrew Geisslermanufacturing flags. They were bits the user could set in registry variables 26*f4febd00SPatrick Williamswhich the firmware would then query. These registry variables were only settable 27*f4febd00SPatrick Williamsby someone with admin authority to the system. These flags were not used outside 28*f4febd00SPatrick Williamsof manufacturing and test. 29f9b334a7SAndrew Geissler 30f9b334a7SAndrew GeisslerExtensions within phosphor-logging may process logs that do not always come 31*f4febd00SPatrick Williamsthrough the standard phosphor-logging interfaces (for example logs sent down by 32*f4febd00SPatrick Williamsthe host). In these cases the system must still halt if those logs contain 33*f4febd00SPatrick Williamshardware callouts. 34f9b334a7SAndrew Geissler 35f9b334a7SAndrew Geissler[This][1] email thread was sent on this topic to the list. 36f9b334a7SAndrew Geissler 37f9b334a7SAndrew Geissler## Requirements 38*f4febd00SPatrick Williams 39f9b334a7SAndrew Geissler- Provide a mechanism to cause the OpenBMC firmware to halt a system if a 40f9b334a7SAndrew Geissler phosphor-logging log is created with a inventory callout 41*f4febd00SPatrick Williams - The mechanism to enable/disable this feature does not need to be an external 42*f4febd00SPatrick Williams API (i.e. Redfish). It can simply be a busctl command one runs in an ssh to 43*f4febd00SPatrick Williams the BMC 44f9b334a7SAndrew Geissler - The halt must be obvious to the user when it occurs 45f9b334a7SAndrew Geissler - The log which causes the halt must be identifiable 46f9b334a7SAndrew Geissler - The halt must only stop the chassis/host instance that encountered the error 4726119113SAndrew Geissler - The halt must allow the host firmware the opportunity to gracefully shut 4826119113SAndrew Geissler itself down 49f9b334a7SAndrew Geissler - The halt must stop the host (run obmc-host-stop@X.target) associated with 50f9b334a7SAndrew Geissler the error and attempt to leave system in the fail state (i.e. chassis power 51f9b334a7SAndrew Geissler remains on if it is on) 52*f4febd00SPatrick Williams - The chassis/host instance pair will not be allowed to power on until the log 53*f4febd00SPatrick Williams that caused the halt is resolved or deleted 54f9b334a7SAndrew Geissler - A BMC reset will clear this power on prevention 556ebb9bd1SAndrew Geissler- Ensure the mechanism used to halt firmware on inventory callouts can also be 566ebb9bd1SAndrew Geissler utilized by phosphor-logging extensions to halt firmware for other causes 576ebb9bd1SAndrew Geissler - These causes will be defined within the extensions documentation 58f9b334a7SAndrew Geissler- Quiesce the associated host during this failure 59f9b334a7SAndrew Geissler 60*f4febd00SPatrick Williams**Special Note:** Initially the associated host and chassis will be hard coded 61*f4febd00SPatrick Williamsto chassis0 and host0. More work throughout the BMC stack is required to handle 62f9b334a7SAndrew Geisslermultiple chassis and hosts. This design allows that type of feature to be 63f9b334a7SAndrew Geisslerenabled at a later time. 64f9b334a7SAndrew Geissler 65f9b334a7SAndrew Geissler## Proposed Design 66*f4febd00SPatrick Williams 67f9b334a7SAndrew GeisslerCreate a [phosphor-settingsd][2] setting, 68f9b334a7SAndrew Geissler`xyz.openbmc_project.Logging.Settings`. Within this create a boolean property 69f9b334a7SAndrew Geisslercalled QuiesceOnHwError. This property will be hosted under the 70f9b334a7SAndrew Geisslerxyz.openbmc_project.Settings service. 71f9b334a7SAndrew Geissler 72f9b334a7SAndrew GeisslerDefine a new D-Bus interface which will indicate an error has been created which 73f9b334a7SAndrew Geisslerwill prevent the boot of a chassis/host instance: 74f9b334a7SAndrew Geissler`xyz.openbmc_project.Logging.ErrorBlocksTransition` 75f9b334a7SAndrew Geissler 76f9b334a7SAndrew GeisslerThis interface will be hosted under a instance based D-Bus object 77f9b334a7SAndrew Geissler`/xyz/openbmc_project/logging/blockX` where X is the instance of the 78f9b334a7SAndrew Geisslerchassis/host pair being blocked. 79f9b334a7SAndrew Geissler 80f9b334a7SAndrew GeisslerWhen an error is created via a phosphor-logging interface, the software will 81f9b334a7SAndrew Geisslercheck to see if the error has a callout, and if so it will check the new 82f9b334a7SAndrew Geissler`xyz.openbmc_project.Logging.Settings.QuiesceOnHwError`. If this is true then 83f9b334a7SAndrew Geisslerphosphor-logging will create a `/xyz/openbmc_project/logging/blockX` D-Bus 84f9b334a7SAndrew Geisslerobject with a `xyz.openbmc_project.Logging.ErrorBlocksTransition` interface 85*f4febd00SPatrick Williamsunder it. A mapper [association][3] between the log and this new D-Bus object 86*f4febd00SPatrick Williamswill be created. The corresponding host instance will be put in quiesce by 87*f4febd00SPatrick Williamsphosphor-logging. 88f9b334a7SAndrew Geissler 89f9b334a7SAndrew GeisslerThe blocked state can be exited by rebooting the BMC or clearing the log 90*f4febd00SPatrick Williamsresponsible for the blocking. Other system specific policies could be placed in 91*f4febd00SPatrick Williamsthe appropriate targets (for example if a chassis power off should clear the 92*f4febd00SPatrick Williamsblock) 93f9b334a7SAndrew Geissler 94f9b334a7SAndrew GeisslerSee the phosphor-logging [callout][4] design for more information on callouts. 95f9b334a7SAndrew Geissler 96*f4febd00SPatrick WilliamsA new `obmc-host-graceful-quiesce@.target` systemd target will be started. This 97*f4febd00SPatrick Williamsnew target will ensure a graceful shutdown of the host is initated and then 98*f4febd00SPatrick Williamsstart the `obmc-host-quiesce@.target` which will stop the host and move the host 99*f4febd00SPatrick Williamsstate to Quiesced. 100f9b334a7SAndrew Geissler 101*f4febd00SPatrick Williamsobmcutil will be enhanced to look for these block interfaces and notify the user 102*f4febd00SPatrick Williamsvia the `obmcutil state` command if a block is enabled and what log is 103*f4febd00SPatrick Williamsassociated with it. 104f9b334a7SAndrew Geissler 105f9b334a7SAndrew GeisslerThe goal is to build upon this concept when future design work is done to allow 106f9b334a7SAndrew Geisslerdevelopers to associate certain error logs with causing a halt to the system 107f9b334a7SAndrew Geissleruntil a log is handled. 108f9b334a7SAndrew Geissler 1096ebb9bd1SAndrew Geissler## Host Errors 1106ebb9bd1SAndrew Geissler 111*f4febd00SPatrick WilliamsIn certain scenarios, it is desirable to also halt the boot, and prevent it from 112*f4febd00SPatrick Williamsrebooting, when the host sends down certain errors to the BMC. 1136ebb9bd1SAndrew Geissler 114*f4febd00SPatrick WilliamsThese errors may be of SEL format, or may be OEM specific, such as the [PEL 115*f4febd00SPatrick Williamsformat][5] used by IBM. 1166ebb9bd1SAndrew Geissler 1176ebb9bd1SAndrew GeisslerThe interfaces provided within phosphor-logging to handle the hardware callout 1186ebb9bd1SAndrew Geisslerscenarios can be repurposed for this use case. 1196ebb9bd1SAndrew Geissler 120f9b334a7SAndrew Geissler## Alternatives Considered 121*f4febd00SPatrick Williams 122*f4febd00SPatrick WilliamsCurrently this feature is a part of the base phosphor-logging design. If no one 123*f4febd00SPatrick Williamsother then IBM sees value, we could roll this into the PEL-specific portion of 124*f4febd00SPatrick Williamsphosphor-logging. 125f9b334a7SAndrew Geissler 126f9b334a7SAndrew Geissler## Impacts 127*f4febd00SPatrick Williams 128f9b334a7SAndrew GeisslerThis will require some additional checking on reported logs but should have 129f9b334a7SAndrew Geisslerminimal overhead. 130f9b334a7SAndrew Geissler 131f9b334a7SAndrew GeisslerThere will be no changes to system behavior unless a user turns on this new 132f9b334a7SAndrew Geisslersetting. 133f9b334a7SAndrew Geissler 134f9b334a7SAndrew Geissler## Testing 135*f4febd00SPatrick Williams 136f9b334a7SAndrew GeisslerUnit tests will be run to ensure logic to detect errors with logs and verify 137f9b334a7SAndrew Geisslerboth possible values of the new setting. 138f9b334a7SAndrew Geissler 139f9b334a7SAndrew GeisslerTest cases will need to look for this new blocking D-Bus object and handle 140f9b334a7SAndrew Geisslerappropriately. 141f9b334a7SAndrew Geissler 142f9b334a7SAndrew Geissler[1]: https://lists.ozlabs.org/pipermail/openbmc/2020-February/020575.html 143f9b334a7SAndrew Geissler[2]: https://github.com/openbmc/phosphor-settingsd 144*f4febd00SPatrick Williams[3]: 145*f4febd00SPatrick Williams https://github.com/openbmc/docs/blob/master/architecture/object-mapper.md#associations 146*f4febd00SPatrick Williams[4]: 147*f4febd00SPatrick Williams https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Common/Callout/README.md 148*f4febd00SPatrick Williams[5]: 149*f4febd00SPatrick Williams https://github.com/openbmc/phosphor-logging/blob/master/extensions/openpower-pels/README.md 150