docs/designs/bmc-service-failure-debug-and-recovery.md

1 # BMC Service Failure Debug and Recovery
11 The capability to debug critical failures of the BMC firmware is essential to
16 - A class of failure exists where a BMC service has entered a failed state but
17   the BMC is still operational in a degraded mode.
19   despite being unable to communicate with the BMC via standard protocols.
22 recovery of a failed BMC.
32 As such, failures of BMC subsystems may impact external consumers.
34 The BMC firmware stack is not trivial, in the sense that common implementations
39 the BMC firmware. The BMC firmware design should provide for resilience and
76 Like error conditions, services exposed by the BMC can be divided into several
92 the BMC.
94 Like error conditions and services, the BMC's external interfaces can be divided
101 unresponsive to the point that the BMC cannot be recovered except via external
103 the host can detect the BMC is unresponsive on the in-band interface(s), an
104 appropriate platform design can enable the host to reset the BMC without
126 to be in a failed state by systemd and not restarted again until a BMC reboot.
158 The ability for external consumers to control the recovery behaviour of BMC
159 services is usually coarse, the nuanced handling is left to the BMC
163 | Severity | BMC Recovery Mechanism  | Used for                                                    …
176 feasible for the BMC to provide any of the above in the event of some kind of
182 However, by escalating straight to 3, the BMC will necessarily miss out on
188 The need to escalate to 3 would indicate that the BMC's own mechanisms for
193 mechanism, support for 2 can be implemented in BMC userspace.
197 needs an interface to the BMC that is dedicated to the role of BMC recovery,
198 with minimal dependencies on the BMC side for initiating the dump collection and
199 reboot. At its core, all that is needed is the ability to trigger a BMC IRQ,
206 1. The BMC executes collection of debug data and then reboots once it observes a
212 2. The host has some indication that a BMC reset has taken place
217    possible the BMC will be unresponsive to recovery mechanism 2
219 #### Analysis of BMC Recovery Mechanisms for Power10 Platforms
222 in-band protocols between the host and the BMC and so is considered resolved for
226 driven by the host to the BMC's EXTRST pin. If the host firmware detects that
227 the BMC has become unresponsive to its escalating recovery requests, it can
228 drive the hardware to forcefully reset the BMC.
236 interfaces between the host and the BMC. These largely consist of:
243 the BMC. If the BMC has become unresponsive, it is possible it's in a state
245 place) and we would need an mechanism architected into FSI for the BMC to
247 BMC is the peripheral in this relationship, with the host driving cycles into it
251 The host already makes use of several LPC peripherals exposed from the BMC:
265 In ASPEED BMC SoCs prior to the AST2600 the LPC mailbox required configuration
267 the BMC's physical address space. The iLPC2AHB capability could not be mitigated
271 alternatives for generating an IRQ on the BMC. We could use the iLPC2AHB from
272 the host to drive one of the watchdogs in the BMC to trigger a reset, but this
278 simple need of generating an IRQ on the BMC. AST2600 has at least 4 KCS devices
284 The proposed design is for a simple daemon started at BMC boot to invoke the
298 collection and the BMC being rebooted.
312 The host and BMC protocol operates as follows, starting with the BMC application
348 A SerIRQ is unnecessary for correct operation of the protocol. The BMC-side
356 that the KCS device will remain available during BMC resets.
358 As STR is polled by the host it's not necessary for the BMC to write to ODR. The
387 status bits can be ignored. With this in mind, the BMC's implementation can be
389 BMC's behaviour in this way allows the use of the `serio_raw` driver (which has
414 unauthenticated means for the host firmware to crash and/or reboot the BMC,
417 BMC firmware must be in the same trust domain. If a platform concept requires
418 that the BMC and host firmware remain in disjoint trust domains then this
419 feature must not be provided by the BMC.
423 documented in such a way that rebooting the BMC in these circumstances isn't
452 failure state are that the BMC:
455 - Collects a BMC dump
456 - Changes BMC state (CurrentBMCState) to indicate a degraded mode of the BMC
457 - Allow system owners to customize other behaviors (i.e. BMC reboot)
465 Define a "obmc-bmc-service-quiesce.target". System owners can install any other
468 phosphor-bmc-state-manager will monitor this target and enter a `Quiesced` state
470 under redfish/v1/Managers/bmc status property.
479   - Request a BMC dump
480   - Start obmc-bmc-service-quiesce.target
481 - BMC state manager detects obmc-bmc-service-quiesce.target has started and puts
482   the BMC state into Quiesced
483 - bmcweb looks at BMC state to return appropriate state to external clients
490 One simpler option would be to just have the OnFailure result in a BMC reboot
493 - Rarely does a BMC reboot fix a service that was not fixed by simply restarting
495 - A BMC that continuously reboots itself due to a service failure is very
497 - Some BMC's only allow a certain amount of reboots so eventually the BMC ends
505 and the external BMC state reflects the failure when this occurs.
510 ensure the appropriate error is logged, dump is collected, and BMC state is