xref: /openbmc/docs/designs/power-recovery.md (revision f3d60e3d1ec71069a1fd1d793ebc967192e4e07d)
1____
2# OpenBMC Server Power Recovery
3
4Author: Andrew Geissler (geissonator)
5
6Primary assignee: Andrew Geissler (geissonator)
7
8Other contributors:
9
10Created: October 11th, 2021
11
12## Problem Description
13Modern computer systems have a feature, automated power-on recovery, which
14in essence is the ability to tell your system what to do when it hits
15issues with power to the system. If the system had a black out (i.e. power
16was completely cut to the system), should it automatically power the system
17on? Should it leave it off? Or maybe the user would like the system to
18go to whichever state it was at before the power loss.
19
20There are also instances where the user may not want automatic power recovery
21to occur. For example, some systems have op-panels, and on these op-panels
22there can be a pin hole reset. This is a manual mechanism for the user to
23force a hard reset to the BMC in situations where it is hung or not responding.
24In these situations, the user may wish for the system to not automatically
25power on the system, because they want to debug the reason for the BMC error.
26
27The goal of this design document is to describe how OpenBMC firmware will
28deal with these questions.
29
30## Background and References
31The BMC already implements a limited subset of function in this area.
32The [PowerRestorePolicy][pdi-restore] property out in phosphor-dbus-interface
33defines the function capability.
34
35In smaller servers, this feature is commonly found within the Advanced
36Configuration and Power Interface (ACPI).
37
38[openbmc/phosphor-state-manager][state-mgr] supports this property as defined
39in the phosphor-dbus-interface.
40
41Future updates to this document will touch on more complex scenarios like
42brown outs (chassis power loss but BMC remains on), handling of external
43uninterrupted power devices (UPS), and enhanced tracking of the different types
44of errors that can occur in this area on systems.
45
46## Requirements
47
48### Automated Power-On Recovery
49OpenBMC software must ensure it persists the state of power to the chassis so
50it can know what to restore it to if necessary
51
52OpenBMC software must provide support for the following options:
53- Do nothing when power is lost to the system (this will be the
54  default)
55- Always power the system on and boot the host
56- Always power the system off (previous power was on, power is now off, run
57  all chassis power off services to ensure a clean state of software and
58  hardware)
59- Restore the previous state of the chassis power and host
60
61These options are only checked and enforced in situations where the BMC does
62not detect that chassis power is already on to the system when it comes out
63of reboot.
64
65OpenBMC software must also support the concept of a one_time power restore
66policy. This is a separate instance of the `PowerRestorePolicy` which will
67be hosted under a D-Bus object path which ends with "one_time". If this
68one_time setting is not the default, `None`, then software will execute
69the policy defined under it, and then reset the one_time property to `None`.
70This one_time feature is a way for software to utilize automated power-on
71recovery function for other areas like firmware update scenarios where a
72certain power on behavior is desired once an update has completed.
73
74### BMC and System Recovery Paths
75In situations where the BMC or the system have gotten into a bad state, and
76the user has initiated some form of manual reset which is detectable by the
77BMC as being user initiated, the BMC software must:
78- Fill in appropriate `RebootCause` within the [BMC state interface][bmc-state]
79  - At a minimum, `PinholeReset` will be added. Others can be added as needed
80- Log an error indicating a user initiated forced reset has occurred
81- Not log an error indicating a blackout has occurred if chassis power was on
82  prior to the pin hole reset
83- Not implement any power recovery policy on the system
84- Turn power recovery back on once BMC has a normal reboot
85
86## Proposed Design
87
88### Automated Power-On Recovery
89An application will be run after the chassis and host states have been
90determined which will only run if the chassis power is not on.
91
92This application will look for the one_time setting and use it if its value
93is not `None`. If it does use the one_time setting then it will reset it
94to `None` once it has read it. Otherwise the application will read the
95persistent value of the `PowerRestorePolicy`. The application will then
96run the logic as defined in the Requirements above.
97
98This function will be hosted in phosphor-state-manger and potentially
99x86-power-control.
100
101### BMC and System Recovery Paths
102The BMC state manager application currently looks at a file in the
103sysfs to try and determine the cause of a BMC reboot. It then puts this
104reason in the `RebootCause` property.
105
106One possible cause of a BMC reset is an external reset (EXTRST). There are
107a variety of reasons an external reset can occur. Some systems are adding
108GPIOs to provide additional detail on these types of resets.
109
110A new GPIO name will be added to the [device-tree-gpio-naming.md][dev-tree]
111which reports whether a pin hole reset has occurred on the previous reboot of
112the BMC. The BMC state manager application will enhance its support of the
113`RebootCause` to look for this GPIO and if present, read it and set
114`RebootCause` accordingly when it can either not determine the reason for
115the reboot via the sysfs or sysfs reports a EXTRST reason (in which case
116the GPIO will be utilized to enhance the reboot reason).
117
118If the power recovery software sees the `PinholeReset` reason within the
119`RebootCause` then it will not implement any of its policy. Future BMC
120reboots which are not pin hole reset caused, will cause `RebootCause` to go
121back to a default and therefore power recovery policy will be reenabled on that
122BMC boot.
123
124The phosphor-state-manager chassis software will not log a blackout error
125if it sees the `PinholeReset` reason (or any other reason that indicates a user
126initiated a reset of the system).
127
128## Alternatives Considered
129None, this is a pretty basic feature that does not have a lot of alternatives
130(other then just not doing it).
131
132## Impacts
133None
134
135## Testing
136The control of this policy can already bet set via the Redfish API.
137```
138#  Power Restore Policy
139curl -k -X PATCH -d '{"PowerRestorePolicy":"AlwaysOn"}' https://${bmc}/redfish/v1/Systems/system
140curl -k -X PATCH -d '{"PowerRestorePolicy":"AlwaysOff"}' https://${bmc}/redfish/v1/Systems/system
141curl -k -X PATCH -d '{"PowerRestorePolicy":"LastState"}' https://${bmc}/redfish/v1/Systems/system
142```
143For testing, each policy should be set and verified. The one_time aspect should
144also be checked for each possible value and verified to only be used once.
145
146Validate that when multiple black outs occur, the firmware continues to try
147and power on the system when policy is `AlwaysOn` or `Restore`.
148
149On supported systems, a pin hole reset should be done with a system that has
150a policy set to always power on. Tester should verify system does not
151automatically power on after a pin hole reset. Verify it does automatically
152power on when a normal reboot of the BMC is done.
153
154[pdi-restore]:https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Control/Power/RestorePolicy.interface.yaml
155[state-mgr]: https://github.com/openbmc/phosphor-state-manager
156[bmc-state]:https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/State/BMC.interface.yaml
157[dev-tree]:https://github.com/openbmc/docs/blob/master/designs/device-tree-gpio-naming.md
158