1# BMC Service Failure Debug and Recovery
2
3Author:
4Andrew Jeffery <andrew@aj.id.au> @arj
5
6Other contributors:
7Andrew Geissler <geissonator@yahoo.com> @geissonator
8
9Created:
106th May 2021
11
12## Problem Description
13
14The capability to debug critical failures of the BMC firmware is essential to
15meet the reliability and serviceability claims made for some platforms.
16
17This design addresses a few classes of failures:
18- A class of failure exists where a BMC service has entered a failed state but
19  the BMC is still operational in a degraded mode.
20- A class of failure exists under which we can attempt debug data collection
21  despite being unable to communicate with the BMC via standard protocols.
22
23This proposal argues for and proposes a software-driven debug data capture
24and recovery of a failed BMC.
25
26## Background and References
27
28By necessity, BMCs are not self-contained systems. BMCs exist to service the
29needs of both the host system by providing in-band platform services such as
30thermal and power management as well as system operators by providing
31out-of-band system management interfaces such as error reporting, platform
32telemetry and firmware management.
33
34As such, failures of BMC subsystems may impact external consumers.
35
36The BMC firmware stack is not trivial, in the sense that common implementations
37are usually a domain-specific Linux distributions with complex or highly
38coupled relationships to platform subsystems.
39
40Complexity and coupling drive concern around the risk of critical failures in
41the BMC firmware. The BMC firmware design should provide for resilience and
42recovery in the face of well-defined error conditions, but the need to mitigate
43ill-defined error conditions or entering unintended software states remains.
44
45The ability for a system to recover in the face of an error condition depends
46on its ability to detect the failure. Thus, error conditions can be assigned to
47various classes based on the ability to externally observe the error:
48
491. Continued operation: The services detects the error and performs the actions
50   required to return to its operating state
51
522. Graceful exit: The service detects an error it cannot recover from, but
53   gracefully cleans up its resources before exiting with an appropriate exit
54   status
55
563. Crash: The service detects it is an unintended software state and exits
57   immediately, failing to gracefully clean up its resources before exiting
58
594. Unresponsive: The service fails to detect it cannot make progress and
60   continues to run but is unresponsive
61
62
63As the state transformations to enter the ill-defined or unintended software
64state are unanticipated, the actions required to gracefully return to an
65expected state are also not well defined. The general approaches to recover a
66system or service to a known state in the face of entering an unknown state
67are:
68
691. Restart the affected service
702. Restart the affected set of services
713. Restart all services
72
73In the face of continued operation due to internal recovery a service restart
74is unnecessary, while in the case of a unresponsive service the need to restart
75cannot be detected by service state alone. Implementation of resiliency by way
76of service restarts via a service manager is only possible in the face of a
77graceful exit or application crash. Handling of services that have entered an
78unresponsive state can only begin upon receiving external input.
79
80Like error conditions, services exposed by the BMC can be divided into several
81external interface classes:
82
831. Providers of platform data
842. Providers of platform data transports
85
86Examples of the first are applications that expose various platform sensors or
87provide data about the firmware itself. Failure of the first class of
88applications usually yields a system that can continue to operate in a reduced
89capacity.
90
91Examples of the second are the operating system itself and applications that
92implement IPMI, HTTPS (e.g. for Redfish), MCTP and PLDM. This second class also
93covers implementation-specific data transports such as D-Bus, which requires a
94broker service. Failure of a platform data transport may result in one or all
95external interfaces becoming unresponsive and be viewed as a critical failure
96of the BMC.
97
98Like error conditions and services, the BMC's external interfaces can be
99divided into several classes:
100
1011. Out-of-band interfaces: Remote, operator-driven platform management
1022. In-band interfaces: Local, host-firmware-driven platform management
103
104Failures of platform data transports generally leave out-of-band interfaces
105unresponsive to the point that the BMC cannot be recovered except via external
106means, usually by issuing a (disruptive) AC power cycle. On the other hand, if
107the host can detect the BMC is unresponsive on the in-band interface(s), an
108appropriate platform design can enable the host to reset the BMC without
109disrupting its own operation.
110
111### Analysis of eBMC Error State Management and Mitigation Mechanisms
112
113Assessing OpenBMC userspace with respect to the error classes outlined above,
114the system manages and mitigates error conditions as follows:
115
116| Condition           | Mechanism                                           |
117|---------------------|-----------------------------------------------------|
118| Continued operation | Application-specific error handling                 |
119| Graceful exit       | Application-specific error handling                 |
120| Crash               | Signal, unhandled exceptions, `assert()`, `abort()` |
121| Unresponsive        | None                                                |
122
123These mechanisms inform systemd (the service manager) of an event, which it
124handles according to the restart policy encoded in the unit file for the
125service.
126
127OpenBMC has a default behavior for all systemd services. That default is to
128allow an OpenBMC systemd service to restart twice every 30 seconds. If a service
129restarts more then twice within 30 seconds then that service will be considered
130to be in a failed state by systemd and not restarted again until a BMC reboot.
131
132Assessing the OpenBMC operating system with respect to the error classes, it
133manages and mitigates error conditions as follows:
134
135| Condition           | Mechanism                              |
136|---------------------|----------------------------------------|
137| Continued operation | ramoops, ftrace, `printk()`            |
138| Graceful exit       | System reboot                          |
139| Crash               | kdump or ramoops                       |
140| Unresponsive        | `hardlockup_panic`, `softlockup_panic` |
141
142Crash conditions in the Linux kernel trigger panics, which are handled by kdump
143(though may be handled by ramoops until kdump support is integrated). Kernel
144lockup conditions can be configured to trigger panics, which in-turn trigger
145either ramoops or kdump.
146
147### Synthesis
148
149In the context of the information above, handling of application lock-up error
150conditions is not provided. For applications in the platform-data-provider
151class of external interfaces, the system will continue to operate with reduced
152functionality. For applications in the platform-data-transport-provider class,
153this represents a critical failure of the firmware that must have accompanying
154debug data.
155
156
157## Handling platform-data-transport-provider failures
158
159### Requirements
160
161#### Recovery Mechanisms
162
163The ability for external consumers to control the recovery behaviour of BMC
164services is usually coarse, the nuanced handling is left to the BMC
165implementation. Where available the options for external consumer tend to be,
166in ascending order of severity:
167
168| Severity | BMC Recovery Mechanism  | Used for                                                              |
169|----------|-------------------------|-----------------------------------------------------------------------|
170| 1        | Graceful reboot request | Normal circumstances or recovery from platform data provider failures |
171| 2        | Forceful reboot request | Recovery from unresponsive platform data transport providers          |
172| 3        | External hardware reset | Unresponsive operating system                                         |
173
174Of course it's not possible to issue these requests over interfaces that are
175unresponsive. A robust platform design should be capable of issuing all three
176restart requests over separate interfaces to minimise the impact of any one
177interface becoming unresponsive. Further, the more severe the reset type, the
178fewer dependencies should be in its execution path.
179
180Given the out-of-band path is often limited to just the network, it's not
181feasible for the BMC to provide any of the above in the event of some kind of
182network or relevant data transport failure. The considerations here are
183therefore limited to recovery of unresponsive in-band interfaces.
184
185The need to escalate above mechanism 1 should come with data that captures why
186it was necessary, i.e. dumps for services that failed in the path for 1.
187However, by escalating straight to 3, the BMC will necessarily miss out on
188capturing a debug dump because there is no opportunity for software to
189intervene in the reset. Therefore, mechanism 2 should exist in the system
190design and its implementation should capture any appropriate data needed to
191debug the need to reboot and the inability to execute on approach 1.
192
193The need to escalate to 3 would indicate that the BMC's own mechanisms for
194detecting a kernel lockup have failed. Had they not failed, we would have
195ramoops or kdump data to analyse. As data cannot be captured with an escalation
196to method 3 the need to invoke it will require its own specialised debug
197experience. Given this and the kernel's own lockup detection and data
198collection mechanism, support for 2 can be implemented in BMC userspace.
199
200Mechanism 1 is typically initiated by the usual in-band interfaces, either IPMI
201or PLDM. In order to avoid these in the implementation of mechanism 2, the host
202needs an interface to the BMC that is dedicated to the role of BMC recovery,
203with minimal dependencies on the BMC side for initiating the dump collection
204and reboot. At its core, all that is needed is the ability to trigger a BMC
205IRQ, which could be as simple as monitoring a GPIO.
206
207#### Behavioural Requirements for Recovery Mechanism 2
208
209The system behaviour requirement for the mechanism is:
210
2111. The BMC executes collection of debug data and then reboots once it observes
212   a recovery message from the host
213
214It's desirable that:
215
2161. The host has some indication that the recovery process has been activated
2172. The host has some indication that a BMC reset has taken place
218
219It's necessary that:
220
2211. The host make use of a timeout to escalate to recovery mechanism 3 as it's
222   possible the BMC will be unresponsive to recovery mechanism 2
223
224#### Analysis of BMC Recovery Mechanisms for Power10 Platforms
225
226The implementation of recovery mechanism 1 is already accounted for in the
227in-band protocols between the host and the BMC and so is considered resolved
228for the purpose of the discussion.
229
230To address recovery mechanism 3, the Power10 platform designs wire up a GPIO
231driven by the host to the BMC's EXTRST pin. If the host firmware detects that
232the BMC has become unresponsive to its escalating recovery requests, it can
233drive the hardware to forcefully reset the BMC.
234
235However, host-side GPIOs are in short supply, and we do not have a dedicated
236pin to implement recovery mechanism 2 in the platform designs.
237
238#### Analysis of Implementation Methods on Power10 Platforms
239
240The implementation of recovery mechanism 2 is limited to using existing
241interfaces between the host and the BMC. These largely consist of:
242
2431. FSI
2442. LPC
2453. PCIe
246
247FSI is inappropriate because the host is the peripheral in its relationship
248with the BMC. If the BMC has become unresponsive, it is possible it's in a
249state where it would not accept FSI traffic (which it needs to drive in the
250first place) and we would need an mechanism architected into FSI for the BMC to
251recognise it is in a bad state. PCIe and LPC are preferable by comparison as
252the BMC is the peripheral in this relationship, with the host driving cycles
253into it over either interface. Comparatively, PCIe is more complex than LPC, so
254an LPC-based approach is preferred.
255
256The host already makes use of several LPC peripherals exposed from the BMC:
257
2581. Mapped LPC FW cycles
2592. iBT for IPMI
2603. The VUARTs for system and debug consoles
2614. A KCS device for a vendor-defined MCTP LPC binding
262
263The host could take advantage of any of the following LPC peripherals for
264implementing recovery mechanism 2:
265
2661. The SuperIO-based iLPC2AHB bridge
2672. The LPC mailbox
2683. An otherwise unused KCS device
269
270In ASPEED BMC SoCs prior to the AST2600 the LPC mailbox required configuration
271via the SuperIO device, which exposes the unrestricted iLPC2AHB backdoor into
272the BMC's physical address space. The iLPC2AHB capability could not be
273mitigated without disabling SuperIO support entirely, and so the ability to use
274the mailbox went with it. This security issue is resolved in the AST2600
275design, so the mailbox could be used in the Power10 platforms, but we have
276lower-complexity alternatives for generating an IRQ on the BMC. We could use
277the iLPC2AHB from the host to drive one of the watchdogs in the BMC to trigger
278a reset, but this exposes a stability risk due to the unrestricted power of the
279interface, let alone the security implications, and like the mailbox is more
280complex than the alternatives.
281
282This draws us towards the use of a KCS device, which is best aligned with the
283simple need of generating an IRQ on the BMC. AST2600 has at least 4 KCS devices
284of which one is already in use for IBM's vendor-defined MCTP LPC binding
285leaving at least 3 from which to choose.
286
287### Proposed Design
288
289The proposed design is for a simple daemon started at BMC boot to invoke the
290desired crash dump handler according to the system policy upon receiving the
291external signal. The implementation should have no IPC dependencies or
292interactions with `init`, as the reason for invoking the recovery mechanism is
293unknown and any of these interfaces might be unresponsive.
294
295A trivial implementation of the daemon is
296
297```sh
298dd if=$path bs=1 count=1
299echo c > /proc/sysrq-trigger
300```
301
302For systems with kdump enabled, this will result in a kernel crash dump
303collection and the BMC being rebooted.
304
305A more elegant implementation might be to invoke `kexec` directly, but this
306requires the support is already available on the platform.
307
308Other activities in userspace might be feasible if it can be assumed that
309whatever failure has occurred will not prevent debug data collection, but no
310statement about this can be made in general.
311
312#### A Idealised KCS-based Protocol for Power10 Platforms
313
314The proposed implementation provides for both the required and desired
315behaviours outlined in the requirements section above.
316
317The host and BMC protocol operates as follows, starting with the BMC
318application invoked during the boot process:
319
3201. Set the `Ready` bit in STR
321
3222. Wait for an `IBF` interrupt
323
3243. Read `IDR`. The hardware clears IBF as a result
325
3264. If the read value is 0x44 (`D` for "Debug") then execute the debug dump
327   collection process and reboot. Otherwise,
328
3295. Go to step 2.
330
331On the host:
332
3331. If the `Ready` bit in STR is clear, escalate to recovery mechanism 3.
334   Otherwise,
335
3362. If the `IBF` bit in STR is set, escalate to recovery mechanism 3. Otherwise,
337
3383. Start an escalation timer
339
3404. Write 0x44 (`D` for "Debug") to the Input Data Register (IDR). The hardware
341   sets IBF as a result
342
3435. If `IBF` clears before expiry, restart the escalation timer
344
3456. If an STR read generates an LPC SYNC No Response abort, or `Ready` clears
346   before expiry, restart the escalation timer
347
3487. If `Ready` becomes set before expiry, disarm the escalation timer. Recovery
349   is complete. Otherwise,
350
3518. Escalate to recovery mechanism 3 if the escalation timer expires at any
352   point
353
354A SerIRQ is unnecessary for correct operation of the protocol. The BMC-side
355implementation is not required to emit one and the host implementation must
356behave correctly without one. Recovery is only necessary if other paths have
357failed, so STR can be read by the host when it decides recovery is required,
358and by read by time-based polling thereafter.
359
360The host must be prepared to handle LPC SYNC errors when accessing the KCS
361device IO addresses, particularly "No Response" aborts. It is not guaranteed
362that the KCS device will remain available during BMC resets.
363
364As STR is polled by the host it's not necessary for the BMC to write to ODR.
365The protocol only requires the host to write to IDR and periodically poll STR
366for changes to IBF and Ready state. This removes bi-directional dependencies.
367
368The uni-directional writes and the lack of SerIRQ reduce the features required
369for correct operation of the protocol and thus the surface area for failure of
370the recovery protocol.
371
372The layout of the KCS Status Register (STR) is as follows:
373
374| Bit | Owner    | Definition               |
375|-----|----------|--------------------------|
376| 7   | Software |                          |
377| 6   | Software |                          |
378| 5   | Software |                          |
379| 4   | Software | Ready                    |
380| 3   | Hardware | Command / Data           |
381| 2   | Software |                          |
382| 1   | Hardware | Input Buffer Full (IBF)  |
383| 0   | Hardware | Output Buffer Full (OBF) |
384
385#### A Real-World Implementation of the KCS Protocol for Power10 Platforms
386
387Implementing the protocol described above in userspace is challenging due to
388available kernel interfaces[1], and implementing the behaviour in the kernel
389falls afoul of the defacto "mechanism, not policy" rule of kernel support.
390
391Realistically, on the host side the only requirements are the use of a timer
392and writing the appropriate value to the Input Data Register (IDR). All the
393proposed status bits can be ignored. With this in mind, the BMC's
394implementation can be reduced to reading an appropriate value from IDR.
395Reducing requirements on the BMC's behaviour in this way allows the use of the
396`serio_raw` driver (which has the restriction that userspace can't access the
397status value).
398
399[1] https://lore.kernel.org/lkml/37e75b07-a5c6-422f-84b3-54f2bea0b917@www.fastmail.com/
400
401#### Prototype Implementation Supporting Power10 Platforms
402
403A concrete implementation of the proposal's userspace daemon is available on
404Github:
405
406https://github.com/amboar/debug-trigger/
407
408Deployment requires additional kernel support in the form of patches at [2].
409
410[2] https://github.com/amboar/linux/compare/2dbb5aeba6e55e2a97e150f8371ffc1cc4d18180...for/openbmc/kcs-raw
411
412### Alternatives Considered
413
414See the discussion in Background.
415
416### Impacts
417
418The proposal has some security implications. The mechanism provides an
419unauthenticated means for the host firmware to crash and/or reboot the BMC,
420which can itself become a concern for stability and availability. Use of this
421feature requires that the host firmware is trusted, that is, that the host and
422BMC firmware must be in the same trust domain. If a platform concept requires
423that the BMC and host firmware remain in disjoint trust domains then this
424feature must not be provided by the BMC.
425
426As the feature might provide surprising system behaviour, there is an impact on
427documentation for systems deploying this design: The mechanism must be
428documented in such a way that rebooting the BMC in these circumstances isn't
429surprising.
430
431Developers are impacted in the sense that they may have access to better debug
432data than might otherwise be possible. There are no obvious developer-specific
433drawbacks.
434
435Due to simplicity being a design-point of the proposal, there are no
436significant API, performance or upgradability impacts.
437
438### Testing
439
440Generally, testing this feature requires complex interactions with
441host firmware and platform-specific mechanisms for triggering the reboot
442behaviour.
443
444For Power10 platforms this feature may be safely tested under QEMU by scripting
445the monitor to inject values on the appropriate KCS device. Implementing this
446for automated testing may need explicit support in CI.
447
448
449## Handling platform-data-provider failures
450
451### Requirements
452
453As noted above, these types of failures usually yield a system that can continue
454to operate in a reduced capacity. The desired behavior in this scenario can
455vary from system to system so the requirements in this area need to be flexible
456enough to allow system owners to configure their desired behavior.
457
458The requirements for OpenBMC when a platform-data-provider service enters a
459failure state are that the BMC:
460- Logs an error indicating a service has failed
461- Collects a BMC dump
462- Changes BMC state (CurrentBMCState) to indicate a degraded mode of the BMC
463- Allow system owners to customize other behaviors (i.e. BMC reboot)
464
465### Proposed Design
466
467This will build upon the existing [target-fail-monitoring][1] design. The
468monitor service will be enhanced to also take json file(s) which list
469critical services to monitor.
470
471Define a "obmc-bmc-service-quiesce.target". System owners can install any
472other services they wish in this new target.
473
474phosphor-bmc-state-manager will monitor this target and enter a `Quiesced`
475state when it is started. This state will be reported externally via the
476Redfish API under redfish/v1/Managers/bmc status property.
477
478This would look like the following:
479- In a services-to-monitor configuration file, add all critical services
480- The state-manager service-monitor will subscribe to signals for service
481  failures and do the following when one fails from within the configuration
482  file:
483  - Log error with service failure information
484  - Request a BMC dump
485  - Start obmc-bmc-service-quiesce.target
486- BMC state manager detects obmc-bmc-service-quiesce.target has started and puts
487  the BMC state into Quiesced
488- bmcweb looks at BMC state to return appropriate state to external clients
489
490[1]: https://github.com/openbmc/docs/blob/master/designs/target-fail-monitoring.md
491
492### Alternatives Considered
493
494One simpler option would be to just have the OnFailure result in a BMC reboot
495but historically this has caused more problems then it solves:
496- Rarely does a BMC reboot fix a service that was not fixed by simply restarting
497  it.
498- A BMC that continuously reboots itself due to a service failure is very
499  difficult to debug.
500- Some BMC's only allow a certain amount of reboots so eventually the
501  BMC ends up stuck in the boot loader which is inaccessible unless special
502  debug cables are available so for all intents and purposes your system is now
503  unusable.
504
505### Impacts
506
507Currently nothing happens when a service enters the fail state. The changes
508proposed in this document will ensure an error is logged a dump is collected,
509and the external BMC state reflects the failure when this occurs.
510
511### Testing
512
513A variety of service should be put into the fail state and the tester should
514ensure the appropriate error is logged, dump is collected, and BMC state
515is changed to reflect this.
516