1# BMC Service Failure Debug and Recovery
2
3Author:
4Andrew Jeffery <andrew@aj.id.au> @arj
5
6Primary Assignee:
7Andrew Jeffery <andrew@aj.id.au> @arj
8
9Created:
106th May 2021
11
12## Problem Description
13
14The capability to debug critical failures of the BMC firmware is essential to
15meet the reliability and serviceability claims made for some platforms.
16
17A class of failure exists under which we can attempt debug data collection
18despite being unable to communicate with the BMC via standard protocols.
19
20This proposal argues for and proposes a software-driven debug data capture
21and recovery of a failed BMC.
22
23## Background and References
24
25By necessity, BMCs are not self-contained systems. BMCs exist to service the
26needs of both the host system by providing in-band platform services such as
27thermal and power management as well as system operators by providing
28out-of-band system management interfaces such as error reporting, platform
29telemetry and firmware management.
30
31As such, failures of BMC subsystems may impact external consumers.
32
33The BMC firmware stack is not trivial, in the sense that common implementations
34are usually a domain-specific Linux distributions with complex or highly
35coupled relationships to platform subsystems.
36
37Complexity and coupling drive concern around the risk of critical failures in
38the BMC firmware. The BMC firmware design should provide for resilience and
39recovery in the face of well-defined error conditions, but the need to mitigate
40ill-defined error conditions or entering unintended software states remains.
41
42The ability for a system to recover in the face of an error condition depends
43on its ability to detect the failure. Thus, error conditions can be assigned to
44various classes based on the ability to externally observe the error:
45
461. Continued operation: The services detects the error and performs the actions
47   required to return to its operating state
48
492. Graceful exit: The service detects an error it cannot recover from, but
50   gracefully cleans up its resources before exiting with an appropriate exit
51   status
52
533. Crash: The service detects it is an unintended software state and exits
54   immediately, failing to gracefully clean up its resources before exiting
55
564. Unresponsive: The service fails to detect it cannot make progress and
57   continues to run but is unresponsive
58
59
60As the state transformations to enter the ill-defined or unintended software
61state are unanticipated, the actions required to gracefully return to an
62expected state are also not well defined. The general approaches to recover a
63system or service to a known state in the face of entering an unknown state
64are:
65
661. Restart the affected service
672. Restart the affected set of services
683. Restart all services
69
70In the face of continued operation due to internal recovery a service restart
71is unnecessary, while in the case of a unresponsive service the need to restart
72cannot be detected by service state alone. Implementation of resiliency by way
73of service restarts via a service manager is only possible in the face of a
74graceful exit or application crash. Handling of services that have entered an
75unresponsive state can only begin upon receiving external input.
76
77Like error conditions, services exposed by the BMC can be divided into several
78external interface classes:
79
801. Providers of platform data
812. Providers of platform data transports
82
83Examples of the first are applications that expose various platform sensors or
84provide data about the firmware itself. Failure of the first class of
85applications usually yields a system that can continue to operate in a reduced
86capacity.
87
88Examples of the second are the operating system itself and applications that
89implement IPMI, HTTPS (e.g. for Redfish), MCTP and PLDM. This second class also
90covers implementation-specific data transports such as D-Bus, which requires a
91broker service. Failure of a platform data transport may result in one or all
92external interfaces becoming unresponsive and be viewed as a critical failure
93of the BMC.
94
95Like error conditions and services, the BMC's external interfaces can be
96divided into several classes:
97
981. Out-of-band interfaces: Remote, operator-driven platform management
992. In-band interfaces: Local, host-firmware-driven platform management
100
101Failures of platform data transports generally leave out-of-band interfaces
102unresponsive to the point that the BMC cannot be recovered except via external
103means, usually by issuing a (disruptive) AC power cycle. On the other hand, if
104the host can detect the BMC is unresponsive on the in-band interface(s), an
105appropriate platform design can enable the host to reset the BMC without
106disrupting its own operation.
107
108### Analysis of eBMC Error State Management and Mitigation Mechanisms
109
110Assessing OpenBMC userspace with respect to the error classes outlined above,
111the system manages and mitigates error conditions as follows:
112
113| Condition           | Mechanism                                           |
114|---------------------|-----------------------------------------------------|
115| Continued operation | Application-specific error handling                 |
116| Graceful exit       | Application-specific error handling                 |
117| Crash               | Signal, unhandled exceptions, `assert()`, `abort()` |
118| Unresponsive        | None                                                |
119
120These mechanisms inform systemd (the service manager) of an event, which it
121handles according to the restart policy encoded in the unit file for the
122service.
123
124Assessing the OpenBMC operating system with respect to the error classes, it
125manages and mitigates error conditions as follows:
126
127| Condition           | Mechanism                              |
128|---------------------|----------------------------------------|
129| Continued operation | ramoops, ftrace, `printk()`            |
130| Graceful exit       | System reboot                          |
131| Crash               | kdump or ramoops                       |
132| Unresponsive        | `hardlockup_panic`, `softlockup_panic` |
133
134Crash conditions in the Linux kernel trigger panics, which are handled by kdump
135(though may be handled by ramoops until kdump support is integrated). Kernel
136lockup conditions can be configured to trigger panics, which in-turn trigger
137either ramoops or kdump.
138
139### Synthesis
140
141In the context of the information above, handling of application lock-up error
142conditions is not provided. For applications in the platform-data-provider
143class of external interfaces, the system will continue to operate with reduced
144functionality. For applications in the platform-data-transport-provider class,
145this represents a critical failure of the firmware that must have accompanying
146debug data.
147
148## Requirements
149
150### Recovery Mechanisms
151
152The ability for external consumers to control the recovery behaviour of BMC
153services is usually coarse, the nuanced handling is left to the BMC
154implementation. Where available the options for external consumer tend to be,
155in ascending order of severity:
156
157| Severity | BMC Recovery Mechanism  | Used for                                                              |
158|----------|-------------------------|-----------------------------------------------------------------------|
159| 1        | Graceful reboot request | Normal circumstances or recovery from platform data provider failures |
160| 2        | Forceful reboot request | Recovery from unresponsive platform data transport providers          |
161| 3        | External hardware reset | Unresponsive operating system                                         |
162
163Of course it's not possible to issue these requests over interfaces that are
164unresponsive. A robust platform design should be capable of issuing all three
165restart requests over separate interfaces to minimise the impact of any one
166interface becoming unresponsive. Further, the more severe the reset type, the
167fewer dependencies should be in its execution path.
168
169Given the out-of-band path is often limited to just the network, it's not
170feasible for the BMC to provide any of the above in the event of some kind of
171network or relevant data transport failure. The considerations here are
172therefore limited to recovery of unresponsive in-band interfaces.
173
174The need to escalate above mechanism 1 should come with data that captures why
175it was necessary, i.e. dumps for services that failed in the path for 1.
176However, by escalating straight to 3, the BMC will necessarily miss out on
177capturing a debug dump because there is no opportunity for software to
178intervene in the reset. Therefore, mechanism 2 should exist in the system
179design and its implementation should capture any appropriate data needed to
180debug the need to reboot and the inability to execute on approach 1.
181
182The need to escalate to 3 would indicate that the BMC's own mechanisms for
183detecting a kernel lockup have failed. Had they not failed, we would have
184ramoops or kdump data to analyse. As data cannot be captured with an escalation
185to method 3 the need to invoke it will require its own specialised debug
186experience. Given this and the kernel's own lockup detection and data
187collection mechanism, support for 2 can be implemented in BMC userspace.
188
189Mechanism 1 is typically initiated by the usual in-band interfaces, either IPMI
190or PLDM. In order to avoid these in the implementation of mechanism 2, the host
191needs an interface to the BMC that is dedicated to the role of BMC recovery,
192with minimal dependencies on the BMC side for initiating the dump collection
193and reboot. At its core, all that is needed is the ability to trigger a BMC
194IRQ, which could be as simple as monitoring a GPIO.
195
196### Behavioural Requirements for Recovery Mechanism 2
197
198The system behaviour requirement for the mechanism is:
199
2001. The BMC executes collection of debug data and then reboots once it observes
201   a recovery message from the host
202
203It's desirable that:
204
2051. The host has some indication that the recovery process has been activated
2062. The host has some indication that a BMC reset has taken place
207
208It's necessary that:
209
2101. The host make use of a timeout to escalate to recovery mechanism 3 as it's
211   possible the BMC will be unresponsive to recovery mechanism 2
212
213### Analysis of BMC Recovery Mechanisms for Power10 Platforms
214
215The implementation of recovery mechanism 1 is already accounted for in the
216in-band protocols between the host and the BMC and so is considered resolved
217for the purpose of the discussion.
218
219To address recovery mechanism 3, the Power10 platform designs wire up a GPIO
220driven by the host to the BMC's EXTRST pin. If the host firmware detects that
221the BMC has become unresponsive to its escalating recovery requests, it can
222drive the hardware to forcefully reset the BMC.
223
224However, host-side GPIOs are in short supply, and we do not have a dedicated
225pin to implement recovery mechanism 2 in the platform designs.
226
227### Analysis of Implementation Methods on Power10 Platforms
228
229The implementation of recovery mechanism 2 is limited to using existing
230interfaces between the host and the BMC. These largely consist of:
231
2321. FSI
2332. LPC
2343. PCIe
235
236FSI is inappropriate because the host is the peripheral in its relationship
237with the BMC. If the BMC has become unresponsive, it is possible it's in a
238state where it would not accept FSI traffic (which it needs to drive in the
239first place) and we would need an mechanism architected into FSI for the BMC to
240recognise it is in a bad state. PCIe and LPC are preferable by comparison as
241the BMC is the peripheral in this relationship, with the host driving cycles
242into it over either interface. Comparatively, PCIe is more complex than LPC, so
243an LPC-based approach is preferred.
244
245The host already makes use of several LPC peripherals exposed from the BMC:
246
2471. Mapped LPC FW cycles
2482. iBT for IPMI
2493. The VUARTs for system and debug consoles
2504. A KCS device for a vendor-defined MCTP LPC binding
251
252The host could take advantage of any of the following LPC peripherals for
253implementing recovery mechanism 2:
254
2551. The SuperIO-based iLPC2AHB bridge
2562. The LPC mailbox
2573. An otherwise unused KCS device
258
259In ASPEED BMC SoCs prior to the AST2600 the LPC mailbox required configuration
260via the SuperIO device, which exposes the unrestricted iLPC2AHB backdoor into
261the BMC's physical address space. The iLPC2AHB capability could not be
262mitigated without disabling SuperIO support entirely, and so the ability to use
263the mailbox went with it. This security issue is resolved in the AST2600
264design, so the mailbox could be used in the Power10 platforms, but we have
265lower-complexity alternatives for generating an IRQ on the BMC. We could use
266the iLPC2AHB from the host to drive one of the watchdogs in the BMC to trigger
267a reset, but this exposes a stability risk due to the unrestricted power of the
268interface, let alone the security implications, and like the mailbox is more
269complex than the alternatives.
270
271This draws us towards the use of a KCS device, which is best aligned with the
272simple need of generating an IRQ on the BMC. AST2600 has at least 4 KCS devices
273of which one is already in use for IBM's vendor-defined MCTP LPC binding
274leaving at least 3 from which to choose.
275
276## Proposed Design
277
278The proposed design is for a simple daemon started at BMC boot to invoke the
279desired crash dump handler according to the system policy upon receiving the
280external signal. The implementation should have no IPC dependencies or
281interactions with `init`, as the reason for invoking the recovery mechanism is
282unknown and any of these interfaces might be unresponsive.
283
284A trivial implementation of the daemon is
285
286```sh
287dd if=$path bs=1 count=1
288echo c > /proc/sysrq-trigger
289```
290
291For systems with kdump enabled, this will result in a kernel crash dump
292collection and the BMC being rebooted.
293
294A more elegant implementation might be to invoke `kexec` directly, but this
295requires the support is already available on the platform.
296
297Other activities in userspace might be feasible if it can be assumed that
298whatever failure has occurred will not prevent debug data collection, but no
299statement about this can be made in general.
300
301### A Idealised KCS-based Protocol for Power10 Platforms
302
303The proposed implementation provides for both the required and desired
304behaviours outlined in the requirements section above.
305
306The host and BMC protocol operates as follows, starting with the BMC
307application invoked during the boot process:
308
3091. Set the `Ready` bit in STR
310
3112. Wait for an `IBF` interrupt
312
3133. Read `IDR`. The hardware clears IBF as a result
314
3154. If the read value is 0x44 (`D` for "Debug") then execute the debug dump
316   collection process and reboot. Otherwise,
317
3185. Go to step 2.
319
320On the host:
321
3221. If the `Ready` bit in STR is clear, escalate to recovery mechanism 3.
323   Otherwise,
324
3252. If the `IBF` bit in STR is set, escalate to recovery mechanism 3. Otherwise,
326
3273. Start an escalation timer
328
3294. Write 0x44 (`D` for "Debug") to the Input Data Register (IDR). The hardware
330   sets IBF as a result
331
3325. If `IBF` clears before expiry, restart the escalation timer
333
3346. If an STR read generates an LPC SYNC No Response abort, or `Ready` clears
335   before expiry, restart the escalation timer
336
3377. If `Ready` becomes set before expiry, disarm the escalation timer. Recovery
338   is complete. Otherwise,
339
3408. Escalate to recovery mechanism 3 if the escalation timer expires at any
341   point
342
343A SerIRQ is unnecessary for correct operation of the protocol. The BMC-side
344implementation is not required to emit one and the host implementation must
345behave correctly without one. Recovery is only necessary if other paths have
346failed, so STR can be read by the host when it decides recovery is required,
347and by read by time-based polling thereafter.
348
349The host must be prepared to handle LPC SYNC errors when accessing the KCS
350device IO addresses, particularly "No Response" aborts. It is not guaranteed
351that the KCS device will remain available during BMC resets.
352
353As STR is polled by the host it's not necessary for the BMC to write to ODR.
354The protocol only requires the host to write to IDR and periodically poll STR
355for changes to IBF and Ready state. This removes bi-directional dependencies.
356
357The uni-directional writes and the lack of SerIRQ reduce the features required
358for correct operation of the protocol and thus the surface area for failure of
359the recovery protocol.
360
361The layout of the KCS Status Register (STR) is as follows:
362
363| Bit | Owner    | Definition               |
364|-----|----------|--------------------------|
365| 7   | Software |                          |
366| 6   | Software |                          |
367| 5   | Software |                          |
368| 4   | Software | Ready                    |
369| 3   | Hardware | Command / Data           |
370| 2   | Software |                          |
371| 1   | Hardware | Input Buffer Full (IBF)  |
372| 0   | Hardware | Output Buffer Full (OBF) |
373
374### A Real-World Implementation of the KCS Protocol for Power10 Platforms
375
376Implementing the protocol described above in userspace is challenging due to
377available kernel interfaces[1], and implementing the behaviour in the kernel
378falls afoul of the defacto "mechanism, not policy" rule of kernel support.
379
380Realistically, on the host side the only requirements are the use of a timer
381and writing the appropriate value to the Input Data Register (IDR). All the
382proposed status bits can be ignored. With this in mind, the BMC's
383implementation can be reduced to reading an appropriate value from IDR.
384Reducing requirements on the BMC's behaviour in this way allows the use of the
385`serio_raw` driver (which has the restriction that userspace can't access the
386status value).
387
388[1] https://lore.kernel.org/lkml/37e75b07-a5c6-422f-84b3-54f2bea0b917@www.fastmail.com/
389
390### Prototype Implementation Supporting Power10 Platforms
391
392A concrete implementation of the proposal's userspace daemon is available on
393Github:
394
395https://github.com/amboar/debug-trigger/
396
397Deployment requires additional kernel support in the form of patches at [2].
398
399[2] https://github.com/amboar/linux/compare/2dbb5aeba6e55e2a97e150f8371ffc1cc4d18180...for/openbmc/kcs-raw
400
401## Alternatives Considered
402
403See the discussion in Background.
404
405## Impacts
406
407The proposal has some security implications. The mechanism provides an
408unauthenticated means for the host firmware to crash and/or reboot the BMC,
409which can itself become a concern for stability and availability. Use of this
410feature requires that the host firmware is trusted, that is, that the host and
411BMC firmware must be in the same trust domain. If a platform concept requires
412that the BMC and host firmware remain in disjoint trust domains then this
413feature must not be provided by the BMC.
414
415As the feature might provide surprising system behaviour, there is an impact on
416documentation for systems deploying this design: The mechanism must be
417documented in such a way that rebooting the BMC in these circumstances isn't
418surprising.
419
420Developers are impacted in the sense that they may have access to better debug
421data than might otherwise be possible. There are no obvious developer-specific
422drawbacks.
423
424Due to simplicity being a design-point of the proposal, there are no
425significant API, performance or upgradability impacts.
426
427## Testing
428
429Generally, testing this feature requires complex interactions with
430host firmware and platform-specific mechanisms for triggering the reboot
431behaviour.
432
433For Power10 platforms this feature may be safely tested under QEMU by scripting
434the monitor to inject values on the appropriate KCS device. Implementing this
435for automated testing may need explicit support in CI.
436