1____ 2# OpenBMC Server Power Recovery 3 4Author: Andrew Geissler (geissonator) 5 6Primary assignee: Andrew Geissler (geissonator) 7 8Other contributors: 9 10Created: October 11th, 2021 11 12## Problem Description 13Modern computer systems have a feature, automated power-on recovery, which 14in essence is the ability to tell your system what to do when it hits 15issues with power to the system. If the system had a black out (i.e. power 16was completely cut to the system), should it automatically power the system 17on? Should it leave it off? Or maybe the user would like the system to 18go to whichever state it was at before the power loss. 19 20There are also instances where the user may not want automatic power recovery 21to occur. For example, some systems have op-panels, and on these op-panels 22there can be a pin hole reset. This is a manual mechanism for the user to 23force a hard reset to the BMC in situations where it is hung or not responding. 24In these situations, the user may wish for the system to not automatically 25power on the system, because they want to debug the reason for the BMC error. 26 27A brownout is another scenario that commonly utilizes automated power-on 28recovery features. A brownout is a scenario where BMC firmware detects (or is 29told) that chassis power can no longer be supported, but power to the BMC 30will be retained. On some systems, it's desired to utilize the automated 31power-on feature to turn chassis power back on as soon as the brownout condition 32ends. 33 34Some system owners may chose to attach an Uninterrupted Power Supply (UPS) to 35their system. A UPS continues to provide power to a system through a blackout 36or brownout scenario. A UPS has a limited amount of power so it's main 37purpose is to handle brief power interruptions or to allow for an orderly 38shutdown of the host firmware. 39 40The goal of this design document is to describe how OpenBMC firmware will 41deal with these questions. 42 43## Background and References 44The BMC already implements a limited subset of function in this area. 45The [PowerRestorePolicy][pdi-restore] property out in phosphor-dbus-interface 46defines the function capability. 47 48In smaller servers, this feature is commonly found within the Advanced 49Configuration and Power Interface (ACPI). 50 51[openbmc/phosphor-state-manager][state-mgr] supports this property as defined 52in the phosphor-dbus-interface. 53 54## Requirements 55 56### Automated Power-On Recovery 57OpenBMC software must ensure it persists the state of power to the chassis so 58it can know what to restore it to if necessary 59 60OpenBMC software must provide support for the following options: 61- Do nothing when power is lost to the system (this will be the 62 default) 63- Always power the system on and boot the host 64- Always power the system off (previous power was on, power is now off, run 65 all chassis power off services to ensure a clean state of software and 66 hardware) 67- Restore the previous state of the chassis power and host 68 69These options are only checked and enforced in situations where the BMC does 70not detect that chassis power is already on to the system when it comes out 71of reboot. 72 73OpenBMC software must also support the concept of a one_time power restore 74policy. This is a separate instance of the `PowerRestorePolicy` which will 75be hosted under a D-Bus object path which ends with "one_time". If this 76one_time setting is not the default, `None`, then software will execute 77the policy defined under it, and then reset the one_time property to `None`. 78This one_time feature is a way for software to utilize automated power-on 79recovery function for other areas like firmware update scenarios where a 80certain power on behavior is desired once an update has completed. 81 82### BMC and System Recovery Paths 83In situations where the BMC or the system have gotten into a bad state, and 84the user has initiated some form of manual reset which is detectable by the 85BMC as being user initiated, the BMC software must: 86- Fill in appropriate `RebootCause` within the [BMC state interface][bmc-state] 87 - At a minimum, `PinholeReset` will be added. Others can be added as needed 88- Log an error indicating a user initiated forced reset has occurred 89- Not log an error indicating a blackout has occurred if chassis power was on 90 prior to the pin hole reset 91- Not implement any power recovery policy on the system 92- Turn power recovery back on once BMC has a normal reboot 93 94### Brownout 95As noted above, a brownout condition is when AC power can not continue to be 96supplied to the chassis, but the BMC can continue to have power and run. 97 98When this condition occurs, the BMC must: 99- Power system off as quickly as situations requires (or gracefully handle 100 the loss of power if it occurred without warning) 101- Log an error indicating the brownout event has occurred 102- Support the ability for host firmware to indicate a one-time power restore 103 policy if they wish for when the brownout completes 104- Identify when a brownout condition has completed 105- Wait for the brownout to complete and implement the one-time power restore 106 policy. If no one-time policy is defined then run the standard power restore 107 policy defined for the system 108 109BMC firmware must also be able to: 110- Discover if system is in a brownout situation 111 - Run when the BMC first comes up to know if it should implement any automated 112 power-on recovery 113- Not run any power-on recovery logic when a brownout is occurring 114- Tell the host firmware that it is a automated power-on recovery initiated 115 boot when that firmware is what boots the system 116 117### Uninterruptible Power Supply (UPS) 118When a UPS is present and a blackout or brownout condition occurs, the BMC must: 119- Log an error to indicate the condition has occurred 120- If host firmware is running, notify the host firmware of this utility failure 121 condition (this behavior is build-time configurable) 122- If the UPS battery power becomes low and if host firmware is running, notify 123 the host firmware of the condition, indicating a quick power off is required 124 (this behavior is build-time configurable) 125- Log an error if the UPS battery power becomes low and a power loss to the 126 entire system is imminent(i.e. a blackout scenario where BMC will also lose 127 power and UPS is about to run out of power) 128- Not execute any automated power-on recovery logic to prevent power on/off 129 thrasing (this behavior is build-time configurable) 130 131## Proposed Design 132 133### Automated Power-On Recovery 134An application will be run after the chassis and host states have been 135determined which will only run if the chassis power is not on. 136 137This application will look for the one_time setting and use it if its value 138is not `None`. If it does use the one_time setting then it will reset it 139to `None` once it has read it. Otherwise the application will read the 140persistent value of the `PowerRestorePolicy`. The application will then 141run the logic as defined in the Requirements above. 142 143This function will be hosted in phosphor-state-manger and potentially 144x86-power-control. 145 146### BMC and System Recovery Paths 147The BMC state manager application currently looks at a file in the 148sysfs to try and determine the cause of a BMC reboot. It then puts this 149reason in the `RebootCause` property. 150 151One possible cause of a BMC reset is an external reset (EXTRST). There are 152a variety of reasons an external reset can occur. Some systems are adding 153GPIOs to provide additional detail on these types of resets. 154 155A new GPIO name will be added to the [device-tree-gpio-naming.md][dev-tree] 156which reports whether a pin hole reset has occurred on the previous reboot of 157the BMC. The BMC state manager application will enhance its support of the 158`RebootCause` to look for this GPIO and if present, read it and set 159`RebootCause` accordingly when it can either not determine the reason for 160the reboot via the sysfs or sysfs reports a EXTRST reason (in which case 161the GPIO will be utilized to enhance the reboot reason). 162 163If the power recovery software sees the `PinholeReset` reason within the 164`RebootCause` then it will not implement any of its policy. Future BMC 165reboots which are not pin hole reset caused, will cause `RebootCause` to go 166back to a default and therefore power recovery policy will be reenabled on that 167BMC boot. 168 169The phosphor-state-manager chassis software will not log a blackout error 170if it sees the `PinholeReset` reason (or any other reason that indicates a user 171initiated a reset of the system). 172 173### Brownout 174The existing `xyz.openbmc_project.State.Chassis` interface will be enhanced to 175support a `CurrentPowerStatus` property. The existing 176phosphor-chassis-state-manager, which is instantiated per instance of chassis in 177the system, will support a read of this property. The following will be the 178possible returned values for the power status of the target chassis: 179- `Undefined` 180- `BrownOut` 181- `UninterruptiblePowerSupply` 182- `Good` 183 184The phosphor-psu-monitor application within the phosphor-power repository will 185be responsible for monitoring for brownout conditions. It will support a 186per-chassis interface which represents the status of the power going into 187the target chassis. This interface will be generic in that other applications 188could host it to report the status of the power. The state-manager software 189will utilize mapper to look for all implementations of the interface for its 190chassis and aggregate the status (i.e. if any reports a brownout, then 191`BrownOut` will be returned). This interface will be defined in a later update 192to this document. 193 194The application(s) responsible for detecting and reporting chassis power will 195run on startup and discover the correct state for their property. These 196applications will log an error when a brownout occurs and initiate the fast 197power off. 198 199If the system design needs it, the existing one-time function provided by 200phosphor-state-manager for auto power on policy will be utilized for when 201the brownout completes. 202 203When the phosphor-power application detects that a brownout condition has 204completed it will reset its interface representing power status to good and 205start the state-manager service which executes the automated power-on logic. 206 207phosphor-state-manager will ensure automated power-on recovery logic is only run 208when the power supply interface reports the power status is good. If there are 209multiple chassis and/or host instances in the system then the host instances 210associated with the chassis(s) with a bad power status will be the only ones 211prevented from booting. 212 213### Uninterruptible Power Supply (UPS) 214A new phosphor-dbus-interface will be defined to represent a UPS. A BMC 215application will implement one of these per UPS attached to the system. 216This application will monitor UPS status and monitor for the following: 217- UPS utility fail (system power has failed and UPS is providing system power) 218- UPS battery low (UPS is about to run out of power) 219 220If the application sees power has been lost and the system is running on 221UPS battery power then it will monitor for the power remaining in the UPS and 222notify the host that a shutdown is required if needed. This application 223will also be responsible for logging an error indicating the UPS backup power 224has been switched to and set the appropriate property in their interface to 225indicate the scenario is present when the system can no longer remain on. 226phosphor-state-manager will query mapper for implementation of this new UPS 227interface and utilize them in combination with power supply brownout status 228when determining the value to return for its `CurrentPowerStatus`. 229 230Similar to the above brownout scenario, phosphor-state-manager will ensure 231automated power-on recovery logic is not run if `PowerStatus` is not set to 232`Good`. This behavior will be build-time configurable within 233phosphor-state-manager. 234 235## Alternatives Considered 236None, this is a pretty basic feature that does not have a lot of alternatives 237(other then just not doing it). 238 239## Impacts 240None 241 242## Testing 243The control of this policy can already bet set via the Redfish API. 244``` 245# Power Restore Policy 246curl -k -X PATCH -d '{"PowerRestorePolicy":"AlwaysOn"}' https://${bmc}/redfish/v1/Systems/system 247curl -k -X PATCH -d '{"PowerRestorePolicy":"AlwaysOff"}' https://${bmc}/redfish/v1/Systems/system 248curl -k -X PATCH -d '{"PowerRestorePolicy":"LastState"}' https://${bmc}/redfish/v1/Systems/system 249``` 250For testing, each policy should be set and verified. The one_time aspect should 251also be checked for each possible value and verified to only be used once. 252 253Validate that when multiple black outs occur, the firmware continues to try 254and power on the system when policy is `AlwaysOn` or `Restore`. 255 256On supported systems, a pin hole reset should be done with a system that has 257a policy set to always power on. Tester should verify system does not 258automatically power on after a pin hole reset. Verify it does automatically 259power on when a normal reboot of the BMC is done. 260 261A brownout condition should be injected into a system and appropriate paths 262should be verified: 263- Error log generated 264- Host notified (if running and notification possible) 265- System quickly powered off 266- Power recovery function is not run while a brownout is present 267- System automatically powers back on when brownout condition ends (assuming a 268 one-time or system auto power-on recovery policy of `AlwaysOn` or `Restore`) 269 270Plug a UPS into a system and ensure when power is cut to the system that an 271error is logged and the host is notified and allowed to power off. 272 273[pdi-restore]:https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Control/Power/RestorePolicy.interface.yaml 274[state-mgr]: https://github.com/openbmc/phosphor-state-manager 275[bmc-state]:https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/State/BMC.interface.yaml 276[dev-tree]:https://github.com/openbmc/docs/blob/master/designs/device-tree-gpio-naming.md 277