docs/designs/bmc-service-failure-debug-and-recovery.md

1 # BMC Service Failure Debug and Recovery
11 The capability to debug critical failures of the BMC firmware is essential to
16 - A class of failure exists where a BMC service has entered a failed state but
17   the BMC is still operational in a degraded mode.
18 - A class of failure exists under which we can attempt debug data collection
19   despite being unable to communicate with the BMC via standard protocols.
21 This proposal argues for and proposes a software-driven debug data capture and
22 recovery of a failed BMC.
26 By necessity, BMCs are not self-contained systems. BMCs exist to service the
27 needs of both the host system by providing in-band platform services such as
29 out-of-band system management interfaces such as error reporting, platform
32 As such, failures of BMC subsystems may impact external consumers.
34 The BMC firmware stack is not trivial, in the sense that common implementations
35 are usually a domain-specific Linux distributions with complex or highly coupled
39 the BMC firmware. The BMC firmware design should provide for resilience and
40 recovery in the face of well-defined error conditions, but the need to mitigate
41 ill-defined error conditions or entering unintended software states remains.
60 As the state transformations to enter the ill-defined or unintended software
76 Like error conditions, services exposed by the BMC can be divided into several
89 covers implementation-specific data transports such as D-Bus, which requires a
92 the BMC.
94 Like error conditions and services, the BMC's external interfaces can be divided
97 1. Out-of-band interfaces: Remote, operator-driven platform management
98 2. In-band interfaces: Local, host-firmware-driven platform management
100 Failures of platform data transports generally leave out-of-band interfaces
101 unresponsive to the point that the BMC cannot be recovered except via external
103 the host can detect the BMC is unresponsive on the in-band interface(s), an
104 appropriate platform design can enable the host to reset the BMC without
113 | ------------------- | --------------------------------------------------- |
114 | Continued operation | Application-specific error handling                 |
115 | Graceful exit       | Application-specific error handling                 |
126 to be in a failed state by systemd and not restarted again until a BMC reboot.
132 | ------------------- | -------------------------------------- |
140 lockup conditions can be configured to trigger panics, which in-turn trigger
145 In the context of the information above, handling of application lock-up error
146 conditions is not provided. For applications in the platform-data-provider class
148 functionality. For applications in the platform-data-transport-provider class,
152 ## Handling platform-data-transport-provider failures
158 The ability for external consumers to control the recovery behaviour of BMC
159 services is usually coarse, the nuanced handling is left to the BMC
163 | Severity | BMC Recovery Mechanism  | Used for                                                    …
164 | -------- | ----------------------- | ------------------------------------------------------------…
175 Given the out-of-band path is often limited to just the network, it's not
176 feasible for the BMC to provide any of the above in the event of some kind of
178 therefore limited to recovery of unresponsive in-band interfaces.
182 However, by escalating straight to 3, the BMC will necessarily miss out on
188 The need to escalate to 3 would indicate that the BMC's own mechanisms for
193 mechanism, support for 2 can be implemented in BMC userspace.
195 Mechanism 1 is typically initiated by the usual in-band interfaces, either IPMI
197 needs an interface to the BMC that is dedicated to the role of BMC recovery,
198 with minimal dependencies on the BMC side for initiating the dump collection and
199 reboot. At its core, all that is needed is the ability to trigger a BMC IRQ,
206 1. The BMC executes collection of debug data and then reboots once it observes a
212 2. The host has some indication that a BMC reset has taken place
217    possible the BMC will be unresponsive to recovery mechanism 2
219 #### Analysis of BMC Recovery Mechanisms for Power10 Platforms
222 in-band protocols between the host and the BMC and so is considered resolved for
226 driven by the host to the BMC's EXTRST pin. If the host firmware detects that
227 the BMC has become unresponsive to its escalating recovery requests, it can
228 drive the hardware to forcefully reset the BMC.
230 However, host-side GPIOs are in short supply, and we do not have a dedicated pin
236 interfaces between the host and the BMC. These largely consist of:
243 the BMC. If the BMC has become unresponsive, it is possible it's in a state
245 place) and we would need an mechanism architected into FSI for the BMC to
247 BMC is the peripheral in this relationship, with the host driving cycles into it
249 LPC-based approach is preferred.
251 The host already makes use of several LPC peripherals exposed from the BMC:
256 4. A KCS device for a vendor-defined MCTP LPC binding
261 1. The SuperIO-based iLPC2AHB bridge
263 3. An otherwise unused KCS device
265 In ASPEED BMC SoCs prior to the AST2600 the LPC mailbox required configuration
267 the BMC's physical address space. The iLPC2AHB capability could not be mitigated
269 mailbox went with it. This security issue is resolved in the AST2600 design, so
270 the mailbox could be used in the Power10 platforms, but we have lower-complexity
271 alternatives for generating an IRQ on the BMC. We could use the iLPC2AHB from
272 the host to drive one of the watchdogs in the BMC to trigger a reset, but this
277 This draws us towards the use of a KCS device, which is best aligned with the
278 simple need of generating an IRQ on the BMC. AST2600 has at least 4 KCS devices
279 of which one is already in use for IBM's vendor-defined MCTP LPC binding leaving
284 The proposed design is for a simple daemon started at BMC boot to invoke the
294 echo c > /proc/sysrq-trigger
298 collection and the BMC being rebooted.
307 #### A Idealised KCS-based Protocol for Power10 Platforms
312 The host and BMC protocol operates as follows, starting with the BMC application
348 A SerIRQ is unnecessary for correct operation of the protocol. The BMC-side
352 by read by time-based polling thereafter.
354 The host must be prepared to handle LPC SYNC errors when accessing the KCS
356 that the KCS device will remain available during BMC resets.
358 As STR is polled by the host it's not necessary for the BMC to write to ODR. The
360 changes to IBF and Ready state. This removes bi-directional dependencies.
362 The uni-directional writes and the lack of SerIRQ reduce the features required
366 The layout of the KCS Status Register (STR) is as follows:
369 | --- | -------- | ------------------------ |
379 #### A Real-World Implementation of the KCS Protocol for Power10 Platforms
387 status bits can be ignored. With this in mind, the BMC's implementation can be
389 BMC's behaviour in this way allows the use of the `serio_raw` driver (which has
393 https://lore.kernel.org/lkml/37e75b07-a5c6-422f-84b3-54f2bea0b917@www.fastmail.com/
400 https://github.com/amboar/debug-trigger/
405 https://github.com/amboar/linux/compare/2dbb5aeba6e55e2a97e150f8371ffc1cc4d18180...for/openbmc/kcs-…
414 unauthenticated means for the host firmware to crash and/or reboot the BMC,
417 BMC firmware must be in the same trust domain. If a platform concept requires
418 that the BMC and host firmware remain in disjoint trust domains then this
419 feature must not be provided by the BMC.
423 documented in such a way that rebooting the BMC in these circumstances isn't
427 data than might otherwise be possible. There are no obvious developer-specific
430 Due to simplicity being a design-point of the proposal, there are no significant
436 and platform-specific mechanisms for triggering the reboot behaviour.
439 the monitor to inject values on the appropriate KCS device. Implementing this
442 ## Handling platform-data-provider failures
451 The requirements for OpenBMC when a platform-data-provider service enters a
452 failure state are that the BMC:
454 - Logs an error indicating a service has failed
455 - Collects a BMC dump
456 - Changes BMC state (CurrentBMCState) to indicate a degraded mode of the BMC
457 - Allow system owners to customize other behaviors (i.e. BMC reboot)
461 This will build upon the existing [target-fail-monitoring][1] design. The
465 Define a "obmc-bmc-service-quiesce.target". System owners can install any other
468 phosphor-bmc-state-manager will monitor this target and enter a `Quiesced` state
470 under redfish/v1/Managers/bmc status property.
474 - In a services-to-monitor configuration file, add all critical services
475 - The state-manager service-monitor will subscribe to signals for service
478   - Log error with service failure information
479   - Request a BMC dump
480   - Start obmc-bmc-service-quiesce.target
481 - BMC state manager detects obmc-bmc-service-quiesce.target has started and puts
482   the BMC state into Quiesced
483 - bmcweb looks at BMC state to return appropriate state to external clients
486   https://github.com/openbmc/docs/blob/master/designs/target-fail-monitoring.md
490 One simpler option would be to just have the OnFailure result in a BMC reboot
493 - Rarely does a BMC reboot fix a service that was not fixed by simply restarting
495 - A BMC that continuously reboots itself due to a service failure is very
497 - Some BMC's only allow a certain amount of reboots so eventually the BMC ends
505 and the external BMC state reflects the failure when this occurs.
510 ensure the appropriate error is logged, dump is collected, and BMC state is