xref: /openbmc/docs/designs/power-recovery.md (revision a53780e9)
1____
2# OpenBMC Server Power Recovery
3
4Author: Andrew Geissler (geissonator)
5
6Primary assignee: Andrew Geissler (geissonator)
7
8Other contributors:
9
10Created: October 11th, 2021
11
12## Problem Description
13Modern computer systems have a feature, automated power-on recovery, which
14in essence is the ability to tell your system what to do when it hits
15issues with power to the system. If the system had a black out (i.e. power
16was completely cut to the system), should it automatically power the system
17on? Should it leave it off? Or maybe the user would like the system to
18go to whichever state it was at before the power loss.
19
20There are also instances where the user may not want automatic power recovery
21to occur. For example, some systems have op-panels, and on these op-panels
22there can be a pin hole reset. This is a manual mechanism for the user to
23force a hard reset to the BMC in situations where it is hung or not responding.
24In these situations, the user may wish for the system to not automatically
25power on the system, because they want to debug the reason for the BMC error.
26
27A brownout is another scenario that commonly utilizes automated power-on
28recovery features. A brownout is a scenario where BMC firmware detects (or is
29told) that chassis power can no longer be supported, but power to the BMC
30will be retained. On some systems, it's desired to utilize the automated
31power-on feature to turn chassis power back on as soon as the brownout condition
32ends.
33
34The goal of this design document is to describe how OpenBMC firmware will
35deal with these questions.
36
37## Background and References
38The BMC already implements a limited subset of function in this area.
39The [PowerRestorePolicy][pdi-restore] property out in phosphor-dbus-interface
40defines the function capability.
41
42In smaller servers, this feature is commonly found within the Advanced
43Configuration and Power Interface (ACPI).
44
45[openbmc/phosphor-state-manager][state-mgr] supports this property as defined
46in the phosphor-dbus-interface.
47
48## Requirements
49
50### Automated Power-On Recovery
51OpenBMC software must ensure it persists the state of power to the chassis so
52it can know what to restore it to if necessary
53
54OpenBMC software must provide support for the following options:
55- Do nothing when power is lost to the system (this will be the
56  default)
57- Always power the system on and boot the host
58- Always power the system off (previous power was on, power is now off, run
59  all chassis power off services to ensure a clean state of software and
60  hardware)
61- Restore the previous state of the chassis power and host
62
63These options are only checked and enforced in situations where the BMC does
64not detect that chassis power is already on to the system when it comes out
65of reboot.
66
67OpenBMC software must also support the concept of a one_time power restore
68policy. This is a separate instance of the `PowerRestorePolicy` which will
69be hosted under a D-Bus object path which ends with "one_time". If this
70one_time setting is not the default, `None`, then software will execute
71the policy defined under it, and then reset the one_time property to `None`.
72This one_time feature is a way for software to utilize automated power-on
73recovery function for other areas like firmware update scenarios where a
74certain power on behavior is desired once an update has completed.
75
76### BMC and System Recovery Paths
77In situations where the BMC or the system have gotten into a bad state, and
78the user has initiated some form of manual reset which is detectable by the
79BMC as being user initiated, the BMC software must:
80- Fill in appropriate `RebootCause` within the [BMC state interface][bmc-state]
81  - At a minimum, `PinholeReset` will be added. Others can be added as needed
82- Log an error indicating a user initiated forced reset has occurred
83- Not log an error indicating a blackout has occurred if chassis power was on
84  prior to the pin hole reset
85- Not implement any power recovery policy on the system
86- Turn power recovery back on once BMC has a normal reboot
87
88### Brownout
89As noted above, a brownout condition is when AC power can not continue to be
90supplied to the chassis, but the BMC can continue to have power and run.
91
92When this condition occurs, the BMC must:
93- Power system off as quickly as situations requires (or gracefully handle
94  the loss of power if it occurred without warning)
95- Log an error indicating the brownout event has occurred
96- Support the ability for host firmware to indicate a one-time power restore
97  policy if they wish for when the brownout completes
98- Identify when a brownout condition has completed
99- Wait for the brownout to complete and implement the one-time power restore
100  policy. If no one-time policy is defined then run the standard power restore
101  policy defined for the system
102
103BMC firmware must also be able to:
104- Discover if system is in a brownout situation
105  - Run when the BMC first comes up to know if it should implement any automated
106    power-on recovery
107- Not run any power-on recovery logic when a brownout is occurring
108- Tell the host firmware that it is a automated power-on recovery initiated
109  boot when that firmware is what boots the system
110
111## Proposed Design
112
113### Automated Power-On Recovery
114An application will be run after the chassis and host states have been
115determined which will only run if the chassis power is not on.
116
117This application will look for the one_time setting and use it if its value
118is not `None`. If it does use the one_time setting then it will reset it
119to `None` once it has read it. Otherwise the application will read the
120persistent value of the `PowerRestorePolicy`. The application will then
121run the logic as defined in the Requirements above.
122
123This function will be hosted in phosphor-state-manger and potentially
124x86-power-control.
125
126### BMC and System Recovery Paths
127The BMC state manager application currently looks at a file in the
128sysfs to try and determine the cause of a BMC reboot. It then puts this
129reason in the `RebootCause` property.
130
131One possible cause of a BMC reset is an external reset (EXTRST). There are
132a variety of reasons an external reset can occur. Some systems are adding
133GPIOs to provide additional detail on these types of resets.
134
135A new GPIO name will be added to the [device-tree-gpio-naming.md][dev-tree]
136which reports whether a pin hole reset has occurred on the previous reboot of
137the BMC. The BMC state manager application will enhance its support of the
138`RebootCause` to look for this GPIO and if present, read it and set
139`RebootCause` accordingly when it can either not determine the reason for
140the reboot via the sysfs or sysfs reports a EXTRST reason (in which case
141the GPIO will be utilized to enhance the reboot reason).
142
143If the power recovery software sees the `PinholeReset` reason within the
144`RebootCause` then it will not implement any of its policy. Future BMC
145reboots which are not pin hole reset caused, will cause `RebootCause` to go
146back to a default and therefore power recovery policy will be reenabled on that
147BMC boot.
148
149The phosphor-state-manager chassis software will not log a blackout error
150if it sees the `PinholeReset` reason (or any other reason that indicates a user
151initiated a reset of the system).
152
153### Brownout
154The existing `xyz.openbmc_project.State.Chassis` interface will be enhanced to
155support a `CurrentPowerStatus` property.  The existing
156phosphor-chassis-state-manager, which is instantiated per instance of chassis in
157the system, will support a read of this property. The following will be the
158possible returned values for the power status of the target chassis:
159- `Undefined`
160- `BrownOut`
161- `Good`
162
163The phosphor-psu-monitor application within the phosphor-power repository will
164be responsible for monitoring for brownout conditions. It will support a
165per-chassis interface which represents the status of the power going into
166the target chassis. This interface will be generic in that other applications
167could host it to report the status of the power. The state-manager software
168will utilize mapper to look for all implementations of the interface for its
169chassis and aggregate the status (i.e. if any reports a brownout, then
170`BrownOut` will be returned). This interface will be defined in a later update
171to this document.
172
173The application(s) responsible for detecting and reporting chassis power will
174run on startup and discover the correct state for their property. These
175applications will log an error when a brownout occurs and initiate the fast
176power off.
177
178If the system design needs it, the existing one-time function provided by
179phosphor-state-manager for auto power on policy will be utilized for when
180the brownout completes.
181
182When the phosphor-power application detects that a brownout condition has
183completed it will reset its interface representing power status to good and
184start the state-manager service which executes the automated power-on logic.
185
186phosphor-state-manager will ensure automated power-on recovery logic is only run
187when the power supply interface reports the power status is good. If there are
188multiple chassis and/or host instances in the system then the host instances
189associated with the chassis(s) with a bad power status will be the only ones
190prevented from booting.
191
192## Alternatives Considered
193None, this is a pretty basic feature that does not have a lot of alternatives
194(other then just not doing it).
195
196## Impacts
197None
198
199## Testing
200The control of this policy can already bet set via the Redfish API.
201```
202#  Power Restore Policy
203curl -k -X PATCH -d '{"PowerRestorePolicy":"AlwaysOn"}' https://${bmc}/redfish/v1/Systems/system
204curl -k -X PATCH -d '{"PowerRestorePolicy":"AlwaysOff"}' https://${bmc}/redfish/v1/Systems/system
205curl -k -X PATCH -d '{"PowerRestorePolicy":"LastState"}' https://${bmc}/redfish/v1/Systems/system
206```
207For testing, each policy should be set and verified. The one_time aspect should
208also be checked for each possible value and verified to only be used once.
209
210Validate that when multiple black outs occur, the firmware continues to try
211and power on the system when policy is `AlwaysOn` or `Restore`.
212
213On supported systems, a pin hole reset should be done with a system that has
214a policy set to always power on. Tester should verify system does not
215automatically power on after a pin hole reset. Verify it does automatically
216power on when a normal reboot of the BMC is done.
217
218A brownout condition should be injected into a system and appropriate paths
219should be verified:
220- Error log generated
221- Host notified (if running and notification possible)
222- System quickly powered off
223- Power recovery function is not run while a brownout is present
224- System automatically powers back on when brownout condition ends (assuming a
225  one-time or system auto power-on recovery policy of `AlwaysOn` or `Restore`)
226
227[pdi-restore]:https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Control/Power/RestorePolicy.interface.yaml
228[state-mgr]: https://github.com/openbmc/phosphor-state-manager
229[bmc-state]:https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/State/BMC.interface.yaml
230[dev-tree]:https://github.com/openbmc/docs/blob/master/designs/device-tree-gpio-naming.md
231