1# BMC Service Failure Debug and Recovery 2 3Author: Andrew Jeffery <andrew@aj.id.au> @arj 4 5Other contributors: Andrew Geissler <geissonator@yahoo.com> @geissonator 6 7Created: 6th May 2021 8 9## Problem Description 10 11The capability to debug critical failures of the BMC firmware is essential to 12meet the reliability and serviceability claims made for some platforms. 13 14This design addresses a few classes of failures: 15 16- A class of failure exists where a BMC service has entered a failed state but 17 the BMC is still operational in a degraded mode. 18- A class of failure exists under which we can attempt debug data collection 19 despite being unable to communicate with the BMC via standard protocols. 20 21This proposal argues for and proposes a software-driven debug data capture and 22recovery of a failed BMC. 23 24## Background and References 25 26By necessity, BMCs are not self-contained systems. BMCs exist to service the 27needs of both the host system by providing in-band platform services such as 28thermal and power management as well as system operators by providing 29out-of-band system management interfaces such as error reporting, platform 30telemetry and firmware management. 31 32As such, failures of BMC subsystems may impact external consumers. 33 34The BMC firmware stack is not trivial, in the sense that common implementations 35are usually a domain-specific Linux distributions with complex or highly coupled 36relationships to platform subsystems. 37 38Complexity and coupling drive concern around the risk of critical failures in 39the BMC firmware. The BMC firmware design should provide for resilience and 40recovery in the face of well-defined error conditions, but the need to mitigate 41ill-defined error conditions or entering unintended software states remains. 42 43The ability for a system to recover in the face of an error condition depends on 44its ability to detect the failure. Thus, error conditions can be assigned to 45various classes based on the ability to externally observe the error: 46 471. Continued operation: The services detects the error and performs the actions 48 required to return to its operating state 49 502. Graceful exit: The service detects an error it cannot recover from, but 51 gracefully cleans up its resources before exiting with an appropriate exit 52 status 53 543. Crash: The service detects it is an unintended software state and exits 55 immediately, failing to gracefully clean up its resources before exiting 56 574. Unresponsive: The service fails to detect it cannot make progress and 58 continues to run but is unresponsive 59 60As the state transformations to enter the ill-defined or unintended software 61state are unanticipated, the actions required to gracefully return to an 62expected state are also not well defined. The general approaches to recover a 63system or service to a known state in the face of entering an unknown state are: 64 651. Restart the affected service 662. Restart the affected set of services 673. Restart all services 68 69In the face of continued operation due to internal recovery a service restart is 70unnecessary, while in the case of a unresponsive service the need to restart 71cannot be detected by service state alone. Implementation of resiliency by way 72of service restarts via a service manager is only possible in the face of a 73graceful exit or application crash. Handling of services that have entered an 74unresponsive state can only begin upon receiving external input. 75 76Like error conditions, services exposed by the BMC can be divided into several 77external interface classes: 78 791. Providers of platform data 802. Providers of platform data transports 81 82Examples of the first are applications that expose various platform sensors or 83provide data about the firmware itself. Failure of the first class of 84applications usually yields a system that can continue to operate in a reduced 85capacity. 86 87Examples of the second are the operating system itself and applications that 88implement IPMI, HTTPS (e.g. for Redfish), MCTP and PLDM. This second class also 89covers implementation-specific data transports such as D-Bus, which requires a 90broker service. Failure of a platform data transport may result in one or all 91external interfaces becoming unresponsive and be viewed as a critical failure of 92the BMC. 93 94Like error conditions and services, the BMC's external interfaces can be divided 95into several classes: 96 971. Out-of-band interfaces: Remote, operator-driven platform management 982. In-band interfaces: Local, host-firmware-driven platform management 99 100Failures of platform data transports generally leave out-of-band interfaces 101unresponsive to the point that the BMC cannot be recovered except via external 102means, usually by issuing a (disruptive) AC power cycle. On the other hand, if 103the host can detect the BMC is unresponsive on the in-band interface(s), an 104appropriate platform design can enable the host to reset the BMC without 105disrupting its own operation. 106 107### Analysis of eBMC Error State Management and Mitigation Mechanisms 108 109Assessing OpenBMC userspace with respect to the error classes outlined above, 110the system manages and mitigates error conditions as follows: 111 112| Condition | Mechanism | 113| ------------------- | --------------------------------------------------- | 114| Continued operation | Application-specific error handling | 115| Graceful exit | Application-specific error handling | 116| Crash | Signal, unhandled exceptions, `assert()`, `abort()` | 117| Unresponsive | None | 118 119These mechanisms inform systemd (the service manager) of an event, which it 120handles according to the restart policy encoded in the unit file for the 121service. 122 123OpenBMC has a default behavior for all systemd services. That default is to 124allow an OpenBMC systemd service to restart twice every 30 seconds. If a service 125restarts more then twice within 30 seconds then that service will be considered 126to be in a failed state by systemd and not restarted again until a BMC reboot. 127 128Assessing the OpenBMC operating system with respect to the error classes, it 129manages and mitigates error conditions as follows: 130 131| Condition | Mechanism | 132| ------------------- | -------------------------------------- | 133| Continued operation | ramoops, ftrace, `printk()` | 134| Graceful exit | System reboot | 135| Crash | kdump or ramoops | 136| Unresponsive | `hardlockup_panic`, `softlockup_panic` | 137 138Crash conditions in the Linux kernel trigger panics, which are handled by kdump 139(though may be handled by ramoops until kdump support is integrated). Kernel 140lockup conditions can be configured to trigger panics, which in-turn trigger 141either ramoops or kdump. 142 143### Synthesis 144 145In the context of the information above, handling of application lock-up error 146conditions is not provided. For applications in the platform-data-provider class 147of external interfaces, the system will continue to operate with reduced 148functionality. For applications in the platform-data-transport-provider class, 149this represents a critical failure of the firmware that must have accompanying 150debug data. 151 152## Handling platform-data-transport-provider failures 153 154### Requirements 155 156#### Recovery Mechanisms 157 158The ability for external consumers to control the recovery behaviour of BMC 159services is usually coarse, the nuanced handling is left to the BMC 160implementation. Where available the options for external consumer tend to be, in 161ascending order of severity: 162 163| Severity | BMC Recovery Mechanism | Used for | 164| -------- | ----------------------- | --------------------------------------------------------------------- | 165| 1 | Graceful reboot request | Normal circumstances or recovery from platform data provider failures | 166| 2 | Forceful reboot request | Recovery from unresponsive platform data transport providers | 167| 3 | External hardware reset | Unresponsive operating system | 168 169Of course it's not possible to issue these requests over interfaces that are 170unresponsive. A robust platform design should be capable of issuing all three 171restart requests over separate interfaces to minimise the impact of any one 172interface becoming unresponsive. Further, the more severe the reset type, the 173fewer dependencies should be in its execution path. 174 175Given the out-of-band path is often limited to just the network, it's not 176feasible for the BMC to provide any of the above in the event of some kind of 177network or relevant data transport failure. The considerations here are 178therefore limited to recovery of unresponsive in-band interfaces. 179 180The need to escalate above mechanism 1 should come with data that captures why 181it was necessary, i.e. dumps for services that failed in the path for 1. 182However, by escalating straight to 3, the BMC will necessarily miss out on 183capturing a debug dump because there is no opportunity for software to intervene 184in the reset. Therefore, mechanism 2 should exist in the system design and its 185implementation should capture any appropriate data needed to debug the need to 186reboot and the inability to execute on approach 1. 187 188The need to escalate to 3 would indicate that the BMC's own mechanisms for 189detecting a kernel lockup have failed. Had they not failed, we would have 190ramoops or kdump data to analyse. As data cannot be captured with an escalation 191to method 3 the need to invoke it will require its own specialised debug 192experience. Given this and the kernel's own lockup detection and data collection 193mechanism, support for 2 can be implemented in BMC userspace. 194 195Mechanism 1 is typically initiated by the usual in-band interfaces, either IPMI 196or PLDM. In order to avoid these in the implementation of mechanism 2, the host 197needs an interface to the BMC that is dedicated to the role of BMC recovery, 198with minimal dependencies on the BMC side for initiating the dump collection and 199reboot. At its core, all that is needed is the ability to trigger a BMC IRQ, 200which could be as simple as monitoring a GPIO. 201 202#### Behavioural Requirements for Recovery Mechanism 2 203 204The system behaviour requirement for the mechanism is: 205 2061. The BMC executes collection of debug data and then reboots once it observes a 207 recovery message from the host 208 209It's desirable that: 210 2111. The host has some indication that the recovery process has been activated 2122. The host has some indication that a BMC reset has taken place 213 214It's necessary that: 215 2161. The host make use of a timeout to escalate to recovery mechanism 3 as it's 217 possible the BMC will be unresponsive to recovery mechanism 2 218 219#### Analysis of BMC Recovery Mechanisms for Power10 Platforms 220 221The implementation of recovery mechanism 1 is already accounted for in the 222in-band protocols between the host and the BMC and so is considered resolved for 223the purpose of the discussion. 224 225To address recovery mechanism 3, the Power10 platform designs wire up a GPIO 226driven by the host to the BMC's EXTRST pin. If the host firmware detects that 227the BMC has become unresponsive to its escalating recovery requests, it can 228drive the hardware to forcefully reset the BMC. 229 230However, host-side GPIOs are in short supply, and we do not have a dedicated pin 231to implement recovery mechanism 2 in the platform designs. 232 233#### Analysis of Implementation Methods on Power10 Platforms 234 235The implementation of recovery mechanism 2 is limited to using existing 236interfaces between the host and the BMC. These largely consist of: 237 2381. FSI 2392. LPC 2403. PCIe 241 242FSI is inappropriate because the host is the peripheral in its relationship with 243the BMC. If the BMC has become unresponsive, it is possible it's in a state 244where it would not accept FSI traffic (which it needs to drive in the first 245place) and we would need an mechanism architected into FSI for the BMC to 246recognise it is in a bad state. PCIe and LPC are preferable by comparison as the 247BMC is the peripheral in this relationship, with the host driving cycles into it 248over either interface. Comparatively, PCIe is more complex than LPC, so an 249LPC-based approach is preferred. 250 251The host already makes use of several LPC peripherals exposed from the BMC: 252 2531. Mapped LPC FW cycles 2542. iBT for IPMI 2553. The VUARTs for system and debug consoles 2564. A KCS device for a vendor-defined MCTP LPC binding 257 258The host could take advantage of any of the following LPC peripherals for 259implementing recovery mechanism 2: 260 2611. The SuperIO-based iLPC2AHB bridge 2622. The LPC mailbox 2633. An otherwise unused KCS device 264 265In ASPEED BMC SoCs prior to the AST2600 the LPC mailbox required configuration 266via the SuperIO device, which exposes the unrestricted iLPC2AHB backdoor into 267the BMC's physical address space. The iLPC2AHB capability could not be mitigated 268without disabling SuperIO support entirely, and so the ability to use the 269mailbox went with it. This security issue is resolved in the AST2600 design, so 270the mailbox could be used in the Power10 platforms, but we have lower-complexity 271alternatives for generating an IRQ on the BMC. We could use the iLPC2AHB from 272the host to drive one of the watchdogs in the BMC to trigger a reset, but this 273exposes a stability risk due to the unrestricted power of the interface, let 274alone the security implications, and like the mailbox is more complex than the 275alternatives. 276 277This draws us towards the use of a KCS device, which is best aligned with the 278simple need of generating an IRQ on the BMC. AST2600 has at least 4 KCS devices 279of which one is already in use for IBM's vendor-defined MCTP LPC binding leaving 280at least 3 from which to choose. 281 282### Proposed Design 283 284The proposed design is for a simple daemon started at BMC boot to invoke the 285desired crash dump handler according to the system policy upon receiving the 286external signal. The implementation should have no IPC dependencies or 287interactions with `init`, as the reason for invoking the recovery mechanism is 288unknown and any of these interfaces might be unresponsive. 289 290A trivial implementation of the daemon is 291 292```sh 293dd if=$path bs=1 count=1 294echo c > /proc/sysrq-trigger 295``` 296 297For systems with kdump enabled, this will result in a kernel crash dump 298collection and the BMC being rebooted. 299 300A more elegant implementation might be to invoke `kexec` directly, but this 301requires the support is already available on the platform. 302 303Other activities in userspace might be feasible if it can be assumed that 304whatever failure has occurred will not prevent debug data collection, but no 305statement about this can be made in general. 306 307#### A Idealised KCS-based Protocol for Power10 Platforms 308 309The proposed implementation provides for both the required and desired 310behaviours outlined in the requirements section above. 311 312The host and BMC protocol operates as follows, starting with the BMC application 313invoked during the boot process: 314 3151. Set the `Ready` bit in STR 316 3172. Wait for an `IBF` interrupt 318 3193. Read `IDR`. The hardware clears IBF as a result 320 3214. If the read value is 0x44 (`D` for "Debug") then execute the debug dump 322 collection process and reboot. Otherwise, 323 3245. Go to step 2. 325 326On the host: 327 3281. If the `Ready` bit in STR is clear, escalate to recovery mechanism 3. 329 Otherwise, 330 3312. If the `IBF` bit in STR is set, escalate to recovery mechanism 3. Otherwise, 332 3333. Start an escalation timer 334 3354. Write 0x44 (`D` for "Debug") to the Input Data Register (IDR). The hardware 336 sets IBF as a result 337 3385. If `IBF` clears before expiry, restart the escalation timer 339 3406. If an STR read generates an LPC SYNC No Response abort, or `Ready` clears 341 before expiry, restart the escalation timer 342 3437. If `Ready` becomes set before expiry, disarm the escalation timer. Recovery 344 is complete. Otherwise, 345 3468. Escalate to recovery mechanism 3 if the escalation timer expires at any point 347 348A SerIRQ is unnecessary for correct operation of the protocol. The BMC-side 349implementation is not required to emit one and the host implementation must 350behave correctly without one. Recovery is only necessary if other paths have 351failed, so STR can be read by the host when it decides recovery is required, and 352by read by time-based polling thereafter. 353 354The host must be prepared to handle LPC SYNC errors when accessing the KCS 355device IO addresses, particularly "No Response" aborts. It is not guaranteed 356that the KCS device will remain available during BMC resets. 357 358As STR is polled by the host it's not necessary for the BMC to write to ODR. The 359protocol only requires the host to write to IDR and periodically poll STR for 360changes to IBF and Ready state. This removes bi-directional dependencies. 361 362The uni-directional writes and the lack of SerIRQ reduce the features required 363for correct operation of the protocol and thus the surface area for failure of 364the recovery protocol. 365 366The layout of the KCS Status Register (STR) is as follows: 367 368| Bit | Owner | Definition | 369| --- | -------- | ------------------------ | 370| 7 | Software | | 371| 6 | Software | | 372| 5 | Software | | 373| 4 | Software | Ready | 374| 3 | Hardware | Command / Data | 375| 2 | Software | | 376| 1 | Hardware | Input Buffer Full (IBF) | 377| 0 | Hardware | Output Buffer Full (OBF) | 378 379#### A Real-World Implementation of the KCS Protocol for Power10 Platforms 380 381Implementing the protocol described above in userspace is challenging due to 382available kernel interfaces[1], and implementing the behaviour in the kernel 383falls afoul of the defacto "mechanism, not policy" rule of kernel support. 384 385Realistically, on the host side the only requirements are the use of a timer and 386writing the appropriate value to the Input Data Register (IDR). All the proposed 387status bits can be ignored. With this in mind, the BMC's implementation can be 388reduced to reading an appropriate value from IDR. Reducing requirements on the 389BMC's behaviour in this way allows the use of the `serio_raw` driver (which has 390the restriction that userspace can't access the status value). 391 392[1] 393https://lore.kernel.org/lkml/37e75b07-a5c6-422f-84b3-54f2bea0b917@www.fastmail.com/ 394 395#### Prototype Implementation Supporting Power10 Platforms 396 397A concrete implementation of the proposal's userspace daemon is available on 398Github: 399 400https://github.com/amboar/debug-trigger/ 401 402Deployment requires additional kernel support in the form of patches at [2]. 403 404[2] 405https://github.com/amboar/linux/compare/2dbb5aeba6e55e2a97e150f8371ffc1cc4d18180...for/openbmc/kcs-raw 406 407### Alternatives Considered 408 409See the discussion in Background. 410 411### Impacts 412 413The proposal has some security implications. The mechanism provides an 414unauthenticated means for the host firmware to crash and/or reboot the BMC, 415which can itself become a concern for stability and availability. Use of this 416feature requires that the host firmware is trusted, that is, that the host and 417BMC firmware must be in the same trust domain. If a platform concept requires 418that the BMC and host firmware remain in disjoint trust domains then this 419feature must not be provided by the BMC. 420 421As the feature might provide surprising system behaviour, there is an impact on 422documentation for systems deploying this design: The mechanism must be 423documented in such a way that rebooting the BMC in these circumstances isn't 424surprising. 425 426Developers are impacted in the sense that they may have access to better debug 427data than might otherwise be possible. There are no obvious developer-specific 428drawbacks. 429 430Due to simplicity being a design-point of the proposal, there are no significant 431API, performance or upgradability impacts. 432 433### Testing 434 435Generally, testing this feature requires complex interactions with host firmware 436and platform-specific mechanisms for triggering the reboot behaviour. 437 438For Power10 platforms this feature may be safely tested under QEMU by scripting 439the monitor to inject values on the appropriate KCS device. Implementing this 440for automated testing may need explicit support in CI. 441 442## Handling platform-data-provider failures 443 444### Requirements 445 446As noted above, these types of failures usually yield a system that can continue 447to operate in a reduced capacity. The desired behavior in this scenario can vary 448from system to system so the requirements in this area need to be flexible 449enough to allow system owners to configure their desired behavior. 450 451The requirements for OpenBMC when a platform-data-provider service enters a 452failure state are that the BMC: 453 454- Logs an error indicating a service has failed 455- Collects a BMC dump 456- Changes BMC state (CurrentBMCState) to indicate a degraded mode of the BMC 457- Allow system owners to customize other behaviors (i.e. BMC reboot) 458 459### Proposed Design 460 461This will build upon the existing [target-fail-monitoring][1] design. The 462monitor service will be enhanced to also take json file(s) which list critical 463services to monitor. 464 465Define a "obmc-bmc-service-quiesce.target". System owners can install any other 466services they wish in this new target. 467 468phosphor-bmc-state-manager will monitor this target and enter a `Quiesced` state 469when it is started. This state will be reported externally via the Redfish API 470under redfish/v1/Managers/bmc status property. 471 472This would look like the following: 473 474- In a services-to-monitor configuration file, add all critical services 475- The state-manager service-monitor will subscribe to signals for service 476 failures and do the following when one fails from within the configuration 477 file: 478 - Log error with service failure information 479 - Request a BMC dump 480 - Start obmc-bmc-service-quiesce.target 481- BMC state manager detects obmc-bmc-service-quiesce.target has started and puts 482 the BMC state into Quiesced 483- bmcweb looks at BMC state to return appropriate state to external clients 484 485[1]: 486 https://github.com/openbmc/docs/blob/master/designs/target-fail-monitoring.md 487 488### Alternatives Considered 489 490One simpler option would be to just have the OnFailure result in a BMC reboot 491but historically this has caused more problems then it solves: 492 493- Rarely does a BMC reboot fix a service that was not fixed by simply restarting 494 it. 495- A BMC that continuously reboots itself due to a service failure is very 496 difficult to debug. 497- Some BMC's only allow a certain amount of reboots so eventually the BMC ends 498 up stuck in the boot loader which is inaccessible unless special debug cables 499 are available so for all intents and purposes your system is now unusable. 500 501### Impacts 502 503Currently nothing happens when a service enters the fail state. The changes 504proposed in this document will ensure an error is logged a dump is collected, 505and the external BMC state reflects the failure when this occurs. 506 507### Testing 508 509A variety of service should be put into the fail state and the tester should 510ensure the appropriate error is logged, dump is collected, and BMC state is 511changed to reflect this. 512