docs/designs/bmc-service-failure-debug-and-recovery.md

16 - A class of failure exists where a BMC service has entered a failed state but
18 - A class of failure exists under which we can attempt debug data collection
21 This proposal argues for and proposes a software-driven debug data capture and
26 By necessity, BMCs are not self-contained systems. BMCs exist to service the
27 needs of both the host system by providing in-band platform services such as
29 out-of-band system management interfaces such as error reporting, platform
35 are usually a domain-specific Linux distributions with complex or highly coupled
40 recovery in the face of well-defined error conditions, but the need to mitigate
41 ill-defined error conditions or entering unintended software states remains.
60 As the state transformations to enter the ill-defined or unintended software
89 covers implementation-specific data transports such as D-Bus, which requires a
97 1. Out-of-band interfaces: Remote, operator-driven platform management
98 2. In-band interfaces: Local, host-firmware-driven platform management
100 Failures of platform data transports generally leave out-of-band interfaces
103 the host can detect the BMC is unresponsive on the in-band interface(s), an
113 | ------------------- | --------------------------------------------------- |
114 | Continued operation | Application-specific error handling                 |
115 | Graceful exit       | Application-specific error handling                 |
132 | ------------------- | -------------------------------------- |
140 lockup conditions can be configured to trigger panics, which in-turn trigger
145 In the context of the information above, handling of application lock-up error
146 conditions is not provided. For applications in the platform-data-provider class
148 functionality. For applications in the platform-data-transport-provider class,
152 ## Handling platform-data-transport-provider failures
164 | -------- | ----------------------- | ------------------------------------------------------------…
167 | 3        | External hardware reset | Unresponsive operating system                               …
175 Given the out-of-band path is often limited to just the network, it's not
178 therefore limited to recovery of unresponsive in-band interfaces.
195 Mechanism 1 is typically initiated by the usual in-band interfaces, either IPMI
222 in-band protocols between the host and the BMC and so is considered resolved for
226 driven by the host to the BMC's EXTRST pin. If the host firmware detects that
228 drive the hardware to forcefully reset the BMC.
230 However, host-side GPIOs are in short supply, and we do not have a dedicated pin
249 LPC-based approach is preferred.
256 4. A KCS device for a vendor-defined MCTP LPC binding
261 1. The SuperIO-based iLPC2AHB bridge
270 the mailbox could be used in the Power10 platforms, but we have lower-complexity
279 of which one is already in use for IBM's vendor-defined MCTP LPC binding leaving
294 echo c > /proc/sysrq-trigger
307 #### A Idealised KCS-based Protocol for Power10 Platforms
319 3. Read `IDR`. The hardware clears IBF as a result
335 4. Write 0x44 (`D` for "Debug") to the Input Data Register (IDR). The hardware
348 A SerIRQ is unnecessary for correct operation of the protocol. The BMC-side
352 by read by time-based polling thereafter.
360 changes to IBF and Ready state. This removes bi-directional dependencies.
362 The uni-directional writes and the lack of SerIRQ reduce the features required
369 | --- | -------- | ------------------------ |
374 | 3   | Hardware | Command / Data           |
376 | 1   | Hardware | Input Buffer Full (IBF)  |
377 | 0   | Hardware | Output Buffer Full (OBF) |
379 #### A Real-World Implementation of the KCS Protocol for Power10 Platforms
393   https://lore.kernel.org/lkml/37e75b07-a5c6-422f-84b3-54f2bea0b917@www.fastmail.com/
400 <https://github.com/amboar/debug-trigger/>
405 …ps://github.com/amboar/linux/compare/2dbb5aeba6e55e2a97e150f8371ffc1cc4d18180...for/openbmc/kcs-raw
427 data than might otherwise be possible. There are no obvious developer-specific
430 Due to simplicity being a design-point of the proposal, there are no significant
436 and platform-specific mechanisms for triggering the reboot behaviour.
442 ## Handling platform-data-provider failures
451 The requirements for OpenBMC when a platform-data-provider service enters a
454 - Logs an error indicating a service has failed
455 - Collects a BMC dump
456 - Changes BMC state (CurrentBMCState) to indicate a degraded mode of the BMC
457 - Allow system owners to customize other behaviors (i.e. BMC reboot)
461 This will build upon the existing [target-fail-monitoring][3] design. The
465 Define a "obmc-bmc-service-quiesce.target". System owners can install any other
468 phosphor-bmc-state-manager will monitor this target and enter a `Quiesced` state
474 - In a services-to-monitor configuration file, add all critical services
475 - The state-manager service-monitor will subscribe to signals for service
478   - Log error with service failure information
479   - Request a BMC dump
480   - Start obmc-bmc-service-quiesce.target
481 - BMC state manager detects obmc-bmc-service-quiesce.target has started and puts
483 - bmcweb looks at BMC state to return appropriate state to external clients
486   https://github.com/openbmc/docs/blob/master/designs/target-fail-monitoring.md
493 - Rarely does a BMC reboot fix a service that was not fixed by simply restarting
495 - A BMC that continuously reboots itself due to a service failure is very
497 - Some BMC's only allow a certain amount of reboots so eventually the BMC ends