1# OpenBMC Server Power Recovery 2 3Author: Andrew Geissler (geissonator) 4 5Other contributors: 6 7Created: October 11th, 2021 8 9## Problem Description 10 11Modern computer systems have a feature, automated power-on recovery, which in 12essence is the ability to tell your system what to do when it hits issues with 13power to the system. If the system had a black out (i.e. power was completely 14cut to the system), should it automatically power the system on? Should it leave 15it off? Or maybe the user would like the system to go to whichever state it was 16at before the power loss. 17 18There are also instances where the user may not want automatic power recovery to 19occur. For example, some systems have op-panels, and on these op-panels there 20can be a pin hole reset. This is a manual mechanism for the user to force a hard 21reset to the BMC in situations where it is hung or not responding. In these 22situations, the user may wish for the system to not automatically power on the 23system, because they want to debug the reason for the BMC error. 24 25During blackout scenarios, system owners may have a set of services they need 26run once the power is restored. For example, IBM requires all LED's be toggled 27to off in a blackout. OpenBMC needs to provide a mechanism for system owners to 28run services in this scenario. 29 30A brownout is another scenario that commonly utilizes automated power-on 31recovery features. A brownout is a scenario where BMC firmware detects (or is 32told) that chassis power can no longer be supported, but power to the BMC will 33be retained. On some systems, it's desired to utilize the automated power-on 34feature to turn chassis power back on as soon as the brownout condition ends. 35 36Some system owners may chose to attach an Uninterrupted Power Supply (UPS) to 37their system. A UPS continues to provide power to a system through a blackout or 38brownout scenario. A UPS has a limited amount of power so it's main purpose is 39to handle brief power interruptions or to allow for an orderly shutdown of the 40host firmware. 41 42The goal of this design document is to describe how OpenBMC firmware will deal 43with these questions. 44 45## Background and References 46 47The BMC already implements a limited subset of function in this area. The 48[PowerRestorePolicy][pdi-restore] property out in phosphor-dbus-interface 49defines the function capability. 50 51In smaller servers, this feature is commonly found within the Advanced 52Configuration and Power Interface (ACPI). 53 54[openbmc/phosphor-state-manager][state-mgr] supports this property as defined in 55the phosphor-dbus-interface. 56 57## Requirements 58 59### Automated Power-On Recovery 60 61OpenBMC software must ensure it persists the state of power to the chassis so it 62can know what to restore it to if necessary 63 64OpenBMC software must provide support for the following options: 65 66- Do nothing when power is lost to the system (this will be the default) 67- Always power the system on and boot the host 68- Always power the system off (previous power was on, power is now off, run all 69 chassis power off services to ensure a clean state of software and hardware) 70- Restore the previous state of the chassis power and host 71 72These options are only checked and enforced in situations where the BMC does not 73detect that chassis power is already on to the system when it comes out of 74reboot. 75 76OpenBMC software must also support the concept of a one_time power restore 77policy. This is a separate instance of the `PowerRestorePolicy` which will be 78hosted under a D-Bus object path which ends with "one_time". If this one_time 79setting is not the default, `None`, then software will execute the policy 80defined under it, and then reset the one_time property to `None`. This one_time 81feature is a way for software to utilize automated power-on recovery function 82for other areas like firmware update scenarios where a certain power on behavior 83is desired once an update has completed. 84 85### BMC and System Recovery Paths 86 87In situations where the BMC or the system have gotten into a bad state, and the 88user has initiated some form of manual reset which is detectable by the BMC as 89being user initiated, the BMC software must: 90 91- Fill in appropriate `RebootCause` within the [BMC state interface][bmc-state] 92 - At a minimum, `PinholeReset` will be added. Others can be added as needed 93- Log an error indicating a user initiated forced reset has occurred 94- Not log an error indicating a blackout has occurred if chassis power was on 95 prior to the pin hole reset 96- Not implement any power recovery policy on the system 97- Turn power recovery back on once BMC has a normal reboot 98 99### Blackout 100 101A blackout occurs when AC power is cut from the system, resulting in a total 102loss of power if there is no UPS installed to keep the system on. To identify 103this scenario after a BMC reboot, chassis-state-manager will check to see what 104the last power state was before the loss of power and compares it against the 105pgood pin. Blackouts can be intentionally triggered by a user (i.e a pinhole 106reset) or in severe cases occur when there is some sort of an external outage. 107In either case the BMC must take into account this detrimental state. When this 108condition occurs, the BMC may(depending on configuration): 109 110- Provide a generic target, `obmc-chassis-blackout@.target` to be called when a 111 blackout is detected 112- Adhere to the current power restore policy 113 114BMC firmware must also be able to: 115 116- Discover why the system is in a blackout situation. From either loss of power 117 or user actions. 118 119### Brownout 120 121As noted above, a brownout condition is when AC power can not continue to be 122supplied to the chassis, but the BMC can continue to have power and run. 123 124When this condition occurs, the BMC must: 125 126- Power system off as quickly as situations requires (or gracefully handle the 127 loss of power if it occurred without warning) 128- Log an error indicating the brownout event has occurred 129- Support the ability for host firmware to indicate a one-time power restore 130 policy if they wish for when the brownout completes 131- Identify when a brownout condition has completed 132- Wait for the brownout to complete and implement the one-time power restore 133 policy. If no one-time policy is defined then run the standard power restore 134 policy defined for the system 135 136BMC firmware must also be able to: 137 138- Discover if system is in a brownout situation 139 - Run when the BMC first comes up to know if it should implement any automated 140 power-on recovery 141- Not run any power-on recovery logic when a brownout is occurring 142- Tell the host firmware that it is a automated power-on recovery initiated boot 143 when that firmware is what boots the system 144 145### Uninterruptible Power Supply (UPS) 146 147When a UPS is present and a blackout or brownout condition occurs, the BMC must: 148 149- Log an error to indicate the condition has occurred 150- If host firmware is running, notify the host firmware of this utility failure 151 condition (this behavior is build-time configurable) 152- If the UPS battery power becomes low and if host firmware is running, notify 153 the host firmware of the condition, indicating a quick power off is required 154 (this behavior is build-time configurable) 155- Log an error if the UPS battery power becomes low and a power loss to the 156 entire system is imminent(i.e. a blackout scenario where BMC will also lose 157 power and UPS is about to run out of power) 158- Not execute any automated power-on recovery logic to prevent power on/off 159 thrasing (this behavior is build-time configurable) 160 161## Proposed Design 162 163### Automated Power-On Recovery 164 165An application will be run after the chassis and host states have been 166determined which will only run if the chassis power is not on. 167 168This application will look for the one_time setting and use it if its value is 169not `None`. If it does use the one_time setting then it will reset it to `None` 170once it has read it. Otherwise the application will read the persistent value of 171the `PowerRestorePolicy`. The application will then run the logic as defined in 172the Requirements above. 173 174This function will be hosted in phosphor-state-manger and potentially 175x86-power-control. 176 177### BMC and System Recovery Paths 178 179The BMC state manager application currently looks at a file in the sysfs to try 180and determine the cause of a BMC reboot. It then puts this reason in the 181`RebootCause` property. 182 183One possible cause of a BMC reset is an external reset (EXTRST). There are a 184variety of reasons an external reset can occur. Some systems are adding GPIOs to 185provide additional detail on these types of resets. 186 187A new GPIO name will be added to the [device-tree-gpio-naming.md][dev-tree] 188which reports whether a pin hole reset has occurred on the previous reboot of 189the BMC. The BMC state manager application will enhance its support of the 190`RebootCause` to look for this GPIO and if present, read it and set 191`RebootCause` accordingly when it can either not determine the reason for the 192reboot via the sysfs or sysfs reports a EXTRST reason (in which case the GPIO 193will be utilized to enhance the reboot reason). 194 195If the power recovery software sees the `PinholeReset` reason within the 196`RebootCause` then it will not implement any of its policy. Future BMC reboots 197which are not pin hole reset caused, will cause `RebootCause` to go back to a 198default and therefore power recovery policy will be re-enabled on that BMC boot. 199 200The phosphor-state-manager chassis software will not log a blackout error if it 201sees the `PinholeReset` reason (or any other reason that indicates a user 202initiated a reset of the system). 203 204### Blackout 205 206A new systemd target `obmc-chassis-blackout.target` should be added to allow 207system maintainers to call services in this condition. This new target will be 208called when the BMC detects a blackout. The target will allow for system owners 209to add their own specific services to this new target. 210Phosphor-chassis-state-manager will ensure `obmc-chassis-blackout.target` will 211be called after a blackout. 212 213### Brownout 214 215The existing `xyz.openbmc_project.State.Chassis` interface will be enhanced to 216support a `CurrentPowerStatus` property. The existing 217phosphor-chassis-state-manager, which is instantiated per instance of chassis in 218the system, will support a read of this property. The following will be the 219possible returned values for the power status of the target chassis: 220 221- `Undefined` 222- `BrownOut` 223- `UninterruptiblePowerSupply` 224- `Good` 225 226The phosphor-psu-monitor application within the phosphor-power repository will 227be responsible for monitoring for brownout conditions. It will support a 228per-chassis interface which represents the status of the power going into the 229target chassis. This interface will be generic in that other applications could 230host it to report the status of the power. The state-manager software will 231utilize mapper to look for all implementations of the interface for its chassis 232and aggregate the status (i.e. if any reports a brownout, then `BrownOut` will 233be returned). This interface will be defined in a later update to this document. 234 235The application(s) responsible for detecting and reporting chassis power will 236run on startup and discover the correct state for their property. These 237applications will log an error when a brownout occurs and initiate the fast 238power off. 239 240If the system design needs it, the existing one-time function provided by 241phosphor-state-manager for auto power on policy will be utilized for when the 242brownout completes. 243 244When the phosphor-power application detects that a brownout condition has 245completed it will reset its interface representing power status to good and 246start the state-manager service which executes the automated power-on logic. 247 248phosphor-state-manager will ensure automated power-on recovery logic is only run 249when the power supply interface reports the power status is good. If there are 250multiple chassis and/or host instances in the system then the host instances 251associated with the chassis(s) with a bad power status will be the only ones 252prevented from booting. 253 254### Uninterruptible Power Supply (UPS) 255 256A new phosphor-dbus-interface will be defined to represent a UPS. A BMC 257application will implement one of these per UPS attached to the system. This 258application will monitor UPS status and monitor for the following: 259 260- UPS utility fail (system power has failed and UPS is providing system power) 261- UPS battery low (UPS is about to run out of power) 262 263If the application sees power has been lost and the system is running on UPS 264battery power then it will monitor for the power remaining in the UPS and notify 265the host that a shutdown is required if needed. This application will also be 266responsible for logging an error indicating the UPS backup power has been 267switched to and set the appropriate property in their interface to indicate the 268scenario is present when the system can no longer remain on. 269phosphor-state-manager will query mapper for implementation of this new UPS 270interface and utilize them in combination with power supply brownout status when 271determining the value to return for its `CurrentPowerStatus`. 272 273Similar to the above brownout scenario, phosphor-state-manager will ensure 274automated power-on recovery logic is not run if `PowerStatus` is not set to 275`Good`. This behavior will be build-time configurable within 276phosphor-state-manager. 277 278## Alternatives Considered 279 280None, this is a pretty basic feature that does not have a lot of alternatives 281(other then just not doing it). 282 283## Impacts 284 285None 286 287## Testing 288 289The control of this policy can already bet set via the Redfish API. 290 291``` 292# Power Restore Policy 293curl -k -H "Content-Type: application/json" -X PATCH -d '{"PowerRestorePolicy":"AlwaysOn"}' https://${bmc}/redfish/v1/Systems/system 294curl -k -H "Content-Type: application/json" -X PATCH -d '{"PowerRestorePolicy":"AlwaysOff"}' https://${bmc}/redfish/v1/Systems/system 295curl -k -H "Content-Type: application/json" -X PATCH -d '{"PowerRestorePolicy":"LastState"}' https://${bmc}/redfish/v1/Systems/system 296``` 297 298For testing, each policy should be set and verified. The one_time aspect should 299also be checked for each possible value and verified to only be used once. 300 301Validate that when multiple black outs occur, the firmware continues to try and 302power on the system when policy is `AlwaysOn` or `Restore`. 303 304On supported systems, a pin hole reset should be done with a system that has a 305policy set to always power on. Tester should verify system does not 306automatically power on after a pin hole reset. Verify it does automatically 307power on when a normal reboot of the BMC is done. 308 309A brownout condition should be injected into a system and appropriate paths 310should be verified: 311 312- Error log generated 313- Host notified (if running and notification possible) 314- System quickly powered off 315- Power recovery function is not run while a brownout is present 316- System automatically powers back on when brownout condition ends (assuming a 317 one-time or system auto power-on recovery policy of `AlwaysOn` or `Restore`) 318 319Plug a UPS into a system and ensure when power is cut to the system that an 320error is logged and the host is notified and allowed to power off. 321 322[pdi-restore]: 323 https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Control/Power/RestorePolicy.interface.yaml 324[state-mgr]: https://github.com/openbmc/phosphor-state-manager 325[bmc-state]: 326 https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/State/BMC.interface.yaml 327[dev-tree]: 328 https://github.com/openbmc/docs/blob/master/designs/device-tree-gpio-naming.md 329