xref: /openbmc/docs/designs/bmc-service-failure-debug-and-recovery.md (revision f4febd002df578bad816239b70950f84ea4567e8)
1# BMC Service Failure Debug and Recovery
2
3Author: Andrew Jeffery <andrew@aj.id.au> @arj
4
5Other contributors: Andrew Geissler <geissonator@yahoo.com> @geissonator
6
7Created: 6th May 2021
8
9## Problem Description
10
11The capability to debug critical failures of the BMC firmware is essential to
12meet the reliability and serviceability claims made for some platforms.
13
14This design addresses a few classes of failures:
15
16- A class of failure exists where a BMC service has entered a failed state but
17  the BMC is still operational in a degraded mode.
18- A class of failure exists under which we can attempt debug data collection
19  despite being unable to communicate with the BMC via standard protocols.
20
21This proposal argues for and proposes a software-driven debug data capture and
22recovery of a failed BMC.
23
24## Background and References
25
26By necessity, BMCs are not self-contained systems. BMCs exist to service the
27needs of both the host system by providing in-band platform services such as
28thermal and power management as well as system operators by providing
29out-of-band system management interfaces such as error reporting, platform
30telemetry and firmware management.
31
32As such, failures of BMC subsystems may impact external consumers.
33
34The BMC firmware stack is not trivial, in the sense that common implementations
35are usually a domain-specific Linux distributions with complex or highly coupled
36relationships to platform subsystems.
37
38Complexity and coupling drive concern around the risk of critical failures in
39the BMC firmware. The BMC firmware design should provide for resilience and
40recovery in the face of well-defined error conditions, but the need to mitigate
41ill-defined error conditions or entering unintended software states remains.
42
43The ability for a system to recover in the face of an error condition depends on
44its ability to detect the failure. Thus, error conditions can be assigned to
45various classes based on the ability to externally observe the error:
46
471. Continued operation: The services detects the error and performs the actions
48   required to return to its operating state
49
502. Graceful exit: The service detects an error it cannot recover from, but
51   gracefully cleans up its resources before exiting with an appropriate exit
52   status
53
543. Crash: The service detects it is an unintended software state and exits
55   immediately, failing to gracefully clean up its resources before exiting
56
574. Unresponsive: The service fails to detect it cannot make progress and
58   continues to run but is unresponsive
59
60As the state transformations to enter the ill-defined or unintended software
61state are unanticipated, the actions required to gracefully return to an
62expected state are also not well defined. The general approaches to recover a
63system or service to a known state in the face of entering an unknown state are:
64
651. Restart the affected service
662. Restart the affected set of services
673. Restart all services
68
69In the face of continued operation due to internal recovery a service restart is
70unnecessary, while in the case of a unresponsive service the need to restart
71cannot be detected by service state alone. Implementation of resiliency by way
72of service restarts via a service manager is only possible in the face of a
73graceful exit or application crash. Handling of services that have entered an
74unresponsive state can only begin upon receiving external input.
75
76Like error conditions, services exposed by the BMC can be divided into several
77external interface classes:
78
791. Providers of platform data
802. Providers of platform data transports
81
82Examples of the first are applications that expose various platform sensors or
83provide data about the firmware itself. Failure of the first class of
84applications usually yields a system that can continue to operate in a reduced
85capacity.
86
87Examples of the second are the operating system itself and applications that
88implement IPMI, HTTPS (e.g. for Redfish), MCTP and PLDM. This second class also
89covers implementation-specific data transports such as D-Bus, which requires a
90broker service. Failure of a platform data transport may result in one or all
91external interfaces becoming unresponsive and be viewed as a critical failure of
92the BMC.
93
94Like error conditions and services, the BMC's external interfaces can be divided
95into several classes:
96
971. Out-of-band interfaces: Remote, operator-driven platform management
982. In-band interfaces: Local, host-firmware-driven platform management
99
100Failures of platform data transports generally leave out-of-band interfaces
101unresponsive to the point that the BMC cannot be recovered except via external
102means, usually by issuing a (disruptive) AC power cycle. On the other hand, if
103the host can detect the BMC is unresponsive on the in-band interface(s), an
104appropriate platform design can enable the host to reset the BMC without
105disrupting its own operation.
106
107### Analysis of eBMC Error State Management and Mitigation Mechanisms
108
109Assessing OpenBMC userspace with respect to the error classes outlined above,
110the system manages and mitigates error conditions as follows:
111
112| Condition           | Mechanism                                           |
113| ------------------- | --------------------------------------------------- |
114| Continued operation | Application-specific error handling                 |
115| Graceful exit       | Application-specific error handling                 |
116| Crash               | Signal, unhandled exceptions, `assert()`, `abort()` |
117| Unresponsive        | None                                                |
118
119These mechanisms inform systemd (the service manager) of an event, which it
120handles according to the restart policy encoded in the unit file for the
121service.
122
123OpenBMC has a default behavior for all systemd services. That default is to
124allow an OpenBMC systemd service to restart twice every 30 seconds. If a service
125restarts more then twice within 30 seconds then that service will be considered
126to be in a failed state by systemd and not restarted again until a BMC reboot.
127
128Assessing the OpenBMC operating system with respect to the error classes, it
129manages and mitigates error conditions as follows:
130
131| Condition           | Mechanism                              |
132| ------------------- | -------------------------------------- |
133| Continued operation | ramoops, ftrace, `printk()`            |
134| Graceful exit       | System reboot                          |
135| Crash               | kdump or ramoops                       |
136| Unresponsive        | `hardlockup_panic`, `softlockup_panic` |
137
138Crash conditions in the Linux kernel trigger panics, which are handled by kdump
139(though may be handled by ramoops until kdump support is integrated). Kernel
140lockup conditions can be configured to trigger panics, which in-turn trigger
141either ramoops or kdump.
142
143### Synthesis
144
145In the context of the information above, handling of application lock-up error
146conditions is not provided. For applications in the platform-data-provider class
147of external interfaces, the system will continue to operate with reduced
148functionality. For applications in the platform-data-transport-provider class,
149this represents a critical failure of the firmware that must have accompanying
150debug data.
151
152## Handling platform-data-transport-provider failures
153
154### Requirements
155
156#### Recovery Mechanisms
157
158The ability for external consumers to control the recovery behaviour of BMC
159services is usually coarse, the nuanced handling is left to the BMC
160implementation. Where available the options for external consumer tend to be, in
161ascending order of severity:
162
163| Severity | BMC Recovery Mechanism  | Used for                                                              |
164| -------- | ----------------------- | --------------------------------------------------------------------- |
165| 1        | Graceful reboot request | Normal circumstances or recovery from platform data provider failures |
166| 2        | Forceful reboot request | Recovery from unresponsive platform data transport providers          |
167| 3        | External hardware reset | Unresponsive operating system                                         |
168
169Of course it's not possible to issue these requests over interfaces that are
170unresponsive. A robust platform design should be capable of issuing all three
171restart requests over separate interfaces to minimise the impact of any one
172interface becoming unresponsive. Further, the more severe the reset type, the
173fewer dependencies should be in its execution path.
174
175Given the out-of-band path is often limited to just the network, it's not
176feasible for the BMC to provide any of the above in the event of some kind of
177network or relevant data transport failure. The considerations here are
178therefore limited to recovery of unresponsive in-band interfaces.
179
180The need to escalate above mechanism 1 should come with data that captures why
181it was necessary, i.e. dumps for services that failed in the path for 1.
182However, by escalating straight to 3, the BMC will necessarily miss out on
183capturing a debug dump because there is no opportunity for software to intervene
184in the reset. Therefore, mechanism 2 should exist in the system design and its
185implementation should capture any appropriate data needed to debug the need to
186reboot and the inability to execute on approach 1.
187
188The need to escalate to 3 would indicate that the BMC's own mechanisms for
189detecting a kernel lockup have failed. Had they not failed, we would have
190ramoops or kdump data to analyse. As data cannot be captured with an escalation
191to method 3 the need to invoke it will require its own specialised debug
192experience. Given this and the kernel's own lockup detection and data collection
193mechanism, support for 2 can be implemented in BMC userspace.
194
195Mechanism 1 is typically initiated by the usual in-band interfaces, either IPMI
196or PLDM. In order to avoid these in the implementation of mechanism 2, the host
197needs an interface to the BMC that is dedicated to the role of BMC recovery,
198with minimal dependencies on the BMC side for initiating the dump collection and
199reboot. At its core, all that is needed is the ability to trigger a BMC IRQ,
200which could be as simple as monitoring a GPIO.
201
202#### Behavioural Requirements for Recovery Mechanism 2
203
204The system behaviour requirement for the mechanism is:
205
2061. The BMC executes collection of debug data and then reboots once it observes a
207   recovery message from the host
208
209It's desirable that:
210
2111. The host has some indication that the recovery process has been activated
2122. The host has some indication that a BMC reset has taken place
213
214It's necessary that:
215
2161. The host make use of a timeout to escalate to recovery mechanism 3 as it's
217   possible the BMC will be unresponsive to recovery mechanism 2
218
219#### Analysis of BMC Recovery Mechanisms for Power10 Platforms
220
221The implementation of recovery mechanism 1 is already accounted for in the
222in-band protocols between the host and the BMC and so is considered resolved for
223the purpose of the discussion.
224
225To address recovery mechanism 3, the Power10 platform designs wire up a GPIO
226driven by the host to the BMC's EXTRST pin. If the host firmware detects that
227the BMC has become unresponsive to its escalating recovery requests, it can
228drive the hardware to forcefully reset the BMC.
229
230However, host-side GPIOs are in short supply, and we do not have a dedicated pin
231to implement recovery mechanism 2 in the platform designs.
232
233#### Analysis of Implementation Methods on Power10 Platforms
234
235The implementation of recovery mechanism 2 is limited to using existing
236interfaces between the host and the BMC. These largely consist of:
237
2381. FSI
2392. LPC
2403. PCIe
241
242FSI is inappropriate because the host is the peripheral in its relationship with
243the BMC. If the BMC has become unresponsive, it is possible it's in a state
244where it would not accept FSI traffic (which it needs to drive in the first
245place) and we would need an mechanism architected into FSI for the BMC to
246recognise it is in a bad state. PCIe and LPC are preferable by comparison as the
247BMC is the peripheral in this relationship, with the host driving cycles into it
248over either interface. Comparatively, PCIe is more complex than LPC, so an
249LPC-based approach is preferred.
250
251The host already makes use of several LPC peripherals exposed from the BMC:
252
2531. Mapped LPC FW cycles
2542. iBT for IPMI
2553. The VUARTs for system and debug consoles
2564. A KCS device for a vendor-defined MCTP LPC binding
257
258The host could take advantage of any of the following LPC peripherals for
259implementing recovery mechanism 2:
260
2611. The SuperIO-based iLPC2AHB bridge
2622. The LPC mailbox
2633. An otherwise unused KCS device
264
265In ASPEED BMC SoCs prior to the AST2600 the LPC mailbox required configuration
266via the SuperIO device, which exposes the unrestricted iLPC2AHB backdoor into
267the BMC's physical address space. The iLPC2AHB capability could not be mitigated
268without disabling SuperIO support entirely, and so the ability to use the
269mailbox went with it. This security issue is resolved in the AST2600 design, so
270the mailbox could be used in the Power10 platforms, but we have lower-complexity
271alternatives for generating an IRQ on the BMC. We could use the iLPC2AHB from
272the host to drive one of the watchdogs in the BMC to trigger a reset, but this
273exposes a stability risk due to the unrestricted power of the interface, let
274alone the security implications, and like the mailbox is more complex than the
275alternatives.
276
277This draws us towards the use of a KCS device, which is best aligned with the
278simple need of generating an IRQ on the BMC. AST2600 has at least 4 KCS devices
279of which one is already in use for IBM's vendor-defined MCTP LPC binding leaving
280at least 3 from which to choose.
281
282### Proposed Design
283
284The proposed design is for a simple daemon started at BMC boot to invoke the
285desired crash dump handler according to the system policy upon receiving the
286external signal. The implementation should have no IPC dependencies or
287interactions with `init`, as the reason for invoking the recovery mechanism is
288unknown and any of these interfaces might be unresponsive.
289
290A trivial implementation of the daemon is
291
292```sh
293dd if=$path bs=1 count=1
294echo c > /proc/sysrq-trigger
295```
296
297For systems with kdump enabled, this will result in a kernel crash dump
298collection and the BMC being rebooted.
299
300A more elegant implementation might be to invoke `kexec` directly, but this
301requires the support is already available on the platform.
302
303Other activities in userspace might be feasible if it can be assumed that
304whatever failure has occurred will not prevent debug data collection, but no
305statement about this can be made in general.
306
307#### A Idealised KCS-based Protocol for Power10 Platforms
308
309The proposed implementation provides for both the required and desired
310behaviours outlined in the requirements section above.
311
312The host and BMC protocol operates as follows, starting with the BMC application
313invoked during the boot process:
314
3151. Set the `Ready` bit in STR
316
3172. Wait for an `IBF` interrupt
318
3193. Read `IDR`. The hardware clears IBF as a result
320
3214. If the read value is 0x44 (`D` for "Debug") then execute the debug dump
322   collection process and reboot. Otherwise,
323
3245. Go to step 2.
325
326On the host:
327
3281. If the `Ready` bit in STR is clear, escalate to recovery mechanism 3.
329   Otherwise,
330
3312. If the `IBF` bit in STR is set, escalate to recovery mechanism 3. Otherwise,
332
3333. Start an escalation timer
334
3354. Write 0x44 (`D` for "Debug") to the Input Data Register (IDR). The hardware
336   sets IBF as a result
337
3385. If `IBF` clears before expiry, restart the escalation timer
339
3406. If an STR read generates an LPC SYNC No Response abort, or `Ready` clears
341   before expiry, restart the escalation timer
342
3437. If `Ready` becomes set before expiry, disarm the escalation timer. Recovery
344   is complete. Otherwise,
345
3468. Escalate to recovery mechanism 3 if the escalation timer expires at any point
347
348A SerIRQ is unnecessary for correct operation of the protocol. The BMC-side
349implementation is not required to emit one and the host implementation must
350behave correctly without one. Recovery is only necessary if other paths have
351failed, so STR can be read by the host when it decides recovery is required, and
352by read by time-based polling thereafter.
353
354The host must be prepared to handle LPC SYNC errors when accessing the KCS
355device IO addresses, particularly "No Response" aborts. It is not guaranteed
356that the KCS device will remain available during BMC resets.
357
358As STR is polled by the host it's not necessary for the BMC to write to ODR. The
359protocol only requires the host to write to IDR and periodically poll STR for
360changes to IBF and Ready state. This removes bi-directional dependencies.
361
362The uni-directional writes and the lack of SerIRQ reduce the features required
363for correct operation of the protocol and thus the surface area for failure of
364the recovery protocol.
365
366The layout of the KCS Status Register (STR) is as follows:
367
368| Bit | Owner    | Definition               |
369| --- | -------- | ------------------------ |
370| 7   | Software |                          |
371| 6   | Software |                          |
372| 5   | Software |                          |
373| 4   | Software | Ready                    |
374| 3   | Hardware | Command / Data           |
375| 2   | Software |                          |
376| 1   | Hardware | Input Buffer Full (IBF)  |
377| 0   | Hardware | Output Buffer Full (OBF) |
378
379#### A Real-World Implementation of the KCS Protocol for Power10 Platforms
380
381Implementing the protocol described above in userspace is challenging due to
382available kernel interfaces[1], and implementing the behaviour in the kernel
383falls afoul of the defacto "mechanism, not policy" rule of kernel support.
384
385Realistically, on the host side the only requirements are the use of a timer and
386writing the appropriate value to the Input Data Register (IDR). All the proposed
387status bits can be ignored. With this in mind, the BMC's implementation can be
388reduced to reading an appropriate value from IDR. Reducing requirements on the
389BMC's behaviour in this way allows the use of the `serio_raw` driver (which has
390the restriction that userspace can't access the status value).
391
392[1]
393https://lore.kernel.org/lkml/37e75b07-a5c6-422f-84b3-54f2bea0b917@www.fastmail.com/
394
395#### Prototype Implementation Supporting Power10 Platforms
396
397A concrete implementation of the proposal's userspace daemon is available on
398Github:
399
400https://github.com/amboar/debug-trigger/
401
402Deployment requires additional kernel support in the form of patches at [2].
403
404[2]
405https://github.com/amboar/linux/compare/2dbb5aeba6e55e2a97e150f8371ffc1cc4d18180...for/openbmc/kcs-raw
406
407### Alternatives Considered
408
409See the discussion in Background.
410
411### Impacts
412
413The proposal has some security implications. The mechanism provides an
414unauthenticated means for the host firmware to crash and/or reboot the BMC,
415which can itself become a concern for stability and availability. Use of this
416feature requires that the host firmware is trusted, that is, that the host and
417BMC firmware must be in the same trust domain. If a platform concept requires
418that the BMC and host firmware remain in disjoint trust domains then this
419feature must not be provided by the BMC.
420
421As the feature might provide surprising system behaviour, there is an impact on
422documentation for systems deploying this design: The mechanism must be
423documented in such a way that rebooting the BMC in these circumstances isn't
424surprising.
425
426Developers are impacted in the sense that they may have access to better debug
427data than might otherwise be possible. There are no obvious developer-specific
428drawbacks.
429
430Due to simplicity being a design-point of the proposal, there are no significant
431API, performance or upgradability impacts.
432
433### Testing
434
435Generally, testing this feature requires complex interactions with host firmware
436and platform-specific mechanisms for triggering the reboot behaviour.
437
438For Power10 platforms this feature may be safely tested under QEMU by scripting
439the monitor to inject values on the appropriate KCS device. Implementing this
440for automated testing may need explicit support in CI.
441
442## Handling platform-data-provider failures
443
444### Requirements
445
446As noted above, these types of failures usually yield a system that can continue
447to operate in a reduced capacity. The desired behavior in this scenario can vary
448from system to system so the requirements in this area need to be flexible
449enough to allow system owners to configure their desired behavior.
450
451The requirements for OpenBMC when a platform-data-provider service enters a
452failure state are that the BMC:
453
454- Logs an error indicating a service has failed
455- Collects a BMC dump
456- Changes BMC state (CurrentBMCState) to indicate a degraded mode of the BMC
457- Allow system owners to customize other behaviors (i.e. BMC reboot)
458
459### Proposed Design
460
461This will build upon the existing [target-fail-monitoring][1] design. The
462monitor service will be enhanced to also take json file(s) which list critical
463services to monitor.
464
465Define a "obmc-bmc-service-quiesce.target". System owners can install any other
466services they wish in this new target.
467
468phosphor-bmc-state-manager will monitor this target and enter a `Quiesced` state
469when it is started. This state will be reported externally via the Redfish API
470under redfish/v1/Managers/bmc status property.
471
472This would look like the following:
473
474- In a services-to-monitor configuration file, add all critical services
475- The state-manager service-monitor will subscribe to signals for service
476  failures and do the following when one fails from within the configuration
477  file:
478  - Log error with service failure information
479  - Request a BMC dump
480  - Start obmc-bmc-service-quiesce.target
481- BMC state manager detects obmc-bmc-service-quiesce.target has started and puts
482  the BMC state into Quiesced
483- bmcweb looks at BMC state to return appropriate state to external clients
484
485[1]:
486  https://github.com/openbmc/docs/blob/master/designs/target-fail-monitoring.md
487
488### Alternatives Considered
489
490One simpler option would be to just have the OnFailure result in a BMC reboot
491but historically this has caused more problems then it solves:
492
493- Rarely does a BMC reboot fix a service that was not fixed by simply restarting
494  it.
495- A BMC that continuously reboots itself due to a service failure is very
496  difficult to debug.
497- Some BMC's only allow a certain amount of reboots so eventually the BMC ends
498  up stuck in the boot loader which is inaccessible unless special debug cables
499  are available so for all intents and purposes your system is now unusable.
500
501### Impacts
502
503Currently nothing happens when a service enters the fail state. The changes
504proposed in this document will ensure an error is logged a dump is collected,
505and the external BMC state reflects the failure when this occurs.
506
507### Testing
508
509A variety of service should be put into the fail state and the tester should
510ensure the appropriate error is logged, dump is collected, and BMC state is
511changed to reflect this.
512