1# BMC Service Failure Debug and Recovery 2 3Author: 4Andrew Jeffery <andrew@aj.id.au> @arj 5 6Other contributors: 7Andrew Geissler <geissonator@yahoo.com> @geissonator 8 9Created: 106th May 2021 11 12## Problem Description 13 14The capability to debug critical failures of the BMC firmware is essential to 15meet the reliability and serviceability claims made for some platforms. 16 17This design addresses a few classes of failures: 18- A class of failure exists where a BMC service has entered a failed state but 19 the BMC is still operational in a degraded mode. 20- A class of failure exists under which we can attempt debug data collection 21 despite being unable to communicate with the BMC via standard protocols. 22 23This proposal argues for and proposes a software-driven debug data capture 24and recovery of a failed BMC. 25 26## Background and References 27 28By necessity, BMCs are not self-contained systems. BMCs exist to service the 29needs of both the host system by providing in-band platform services such as 30thermal and power management as well as system operators by providing 31out-of-band system management interfaces such as error reporting, platform 32telemetry and firmware management. 33 34As such, failures of BMC subsystems may impact external consumers. 35 36The BMC firmware stack is not trivial, in the sense that common implementations 37are usually a domain-specific Linux distributions with complex or highly 38coupled relationships to platform subsystems. 39 40Complexity and coupling drive concern around the risk of critical failures in 41the BMC firmware. The BMC firmware design should provide for resilience and 42recovery in the face of well-defined error conditions, but the need to mitigate 43ill-defined error conditions or entering unintended software states remains. 44 45The ability for a system to recover in the face of an error condition depends 46on its ability to detect the failure. Thus, error conditions can be assigned to 47various classes based on the ability to externally observe the error: 48 491. Continued operation: The services detects the error and performs the actions 50 required to return to its operating state 51 522. Graceful exit: The service detects an error it cannot recover from, but 53 gracefully cleans up its resources before exiting with an appropriate exit 54 status 55 563. Crash: The service detects it is an unintended software state and exits 57 immediately, failing to gracefully clean up its resources before exiting 58 594. Unresponsive: The service fails to detect it cannot make progress and 60 continues to run but is unresponsive 61 62 63As the state transformations to enter the ill-defined or unintended software 64state are unanticipated, the actions required to gracefully return to an 65expected state are also not well defined. The general approaches to recover a 66system or service to a known state in the face of entering an unknown state 67are: 68 691. Restart the affected service 702. Restart the affected set of services 713. Restart all services 72 73In the face of continued operation due to internal recovery a service restart 74is unnecessary, while in the case of a unresponsive service the need to restart 75cannot be detected by service state alone. Implementation of resiliency by way 76of service restarts via a service manager is only possible in the face of a 77graceful exit or application crash. Handling of services that have entered an 78unresponsive state can only begin upon receiving external input. 79 80Like error conditions, services exposed by the BMC can be divided into several 81external interface classes: 82 831. Providers of platform data 842. Providers of platform data transports 85 86Examples of the first are applications that expose various platform sensors or 87provide data about the firmware itself. Failure of the first class of 88applications usually yields a system that can continue to operate in a reduced 89capacity. 90 91Examples of the second are the operating system itself and applications that 92implement IPMI, HTTPS (e.g. for Redfish), MCTP and PLDM. This second class also 93covers implementation-specific data transports such as D-Bus, which requires a 94broker service. Failure of a platform data transport may result in one or all 95external interfaces becoming unresponsive and be viewed as a critical failure 96of the BMC. 97 98Like error conditions and services, the BMC's external interfaces can be 99divided into several classes: 100 1011. Out-of-band interfaces: Remote, operator-driven platform management 1022. In-band interfaces: Local, host-firmware-driven platform management 103 104Failures of platform data transports generally leave out-of-band interfaces 105unresponsive to the point that the BMC cannot be recovered except via external 106means, usually by issuing a (disruptive) AC power cycle. On the other hand, if 107the host can detect the BMC is unresponsive on the in-band interface(s), an 108appropriate platform design can enable the host to reset the BMC without 109disrupting its own operation. 110 111### Analysis of eBMC Error State Management and Mitigation Mechanisms 112 113Assessing OpenBMC userspace with respect to the error classes outlined above, 114the system manages and mitigates error conditions as follows: 115 116| Condition | Mechanism | 117|---------------------|-----------------------------------------------------| 118| Continued operation | Application-specific error handling | 119| Graceful exit | Application-specific error handling | 120| Crash | Signal, unhandled exceptions, `assert()`, `abort()` | 121| Unresponsive | None | 122 123These mechanisms inform systemd (the service manager) of an event, which it 124handles according to the restart policy encoded in the unit file for the 125service. 126 127OpenBMC has a default behavior for all systemd services. That default is to 128allow an OpenBMC systemd service to restart twice every 30 seconds. If a service 129restarts more then twice within 30 seconds then that service will be considered 130to be in a failed state by systemd and not restarted again until a BMC reboot. 131 132Assessing the OpenBMC operating system with respect to the error classes, it 133manages and mitigates error conditions as follows: 134 135| Condition | Mechanism | 136|---------------------|----------------------------------------| 137| Continued operation | ramoops, ftrace, `printk()` | 138| Graceful exit | System reboot | 139| Crash | kdump or ramoops | 140| Unresponsive | `hardlockup_panic`, `softlockup_panic` | 141 142Crash conditions in the Linux kernel trigger panics, which are handled by kdump 143(though may be handled by ramoops until kdump support is integrated). Kernel 144lockup conditions can be configured to trigger panics, which in-turn trigger 145either ramoops or kdump. 146 147### Synthesis 148 149In the context of the information above, handling of application lock-up error 150conditions is not provided. For applications in the platform-data-provider 151class of external interfaces, the system will continue to operate with reduced 152functionality. For applications in the platform-data-transport-provider class, 153this represents a critical failure of the firmware that must have accompanying 154debug data. 155 156 157## Handling platform-data-transport-provider failures 158 159### Requirements 160 161#### Recovery Mechanisms 162 163The ability for external consumers to control the recovery behaviour of BMC 164services is usually coarse, the nuanced handling is left to the BMC 165implementation. Where available the options for external consumer tend to be, 166in ascending order of severity: 167 168| Severity | BMC Recovery Mechanism | Used for | 169|----------|-------------------------|-----------------------------------------------------------------------| 170| 1 | Graceful reboot request | Normal circumstances or recovery from platform data provider failures | 171| 2 | Forceful reboot request | Recovery from unresponsive platform data transport providers | 172| 3 | External hardware reset | Unresponsive operating system | 173 174Of course it's not possible to issue these requests over interfaces that are 175unresponsive. A robust platform design should be capable of issuing all three 176restart requests over separate interfaces to minimise the impact of any one 177interface becoming unresponsive. Further, the more severe the reset type, the 178fewer dependencies should be in its execution path. 179 180Given the out-of-band path is often limited to just the network, it's not 181feasible for the BMC to provide any of the above in the event of some kind of 182network or relevant data transport failure. The considerations here are 183therefore limited to recovery of unresponsive in-band interfaces. 184 185The need to escalate above mechanism 1 should come with data that captures why 186it was necessary, i.e. dumps for services that failed in the path for 1. 187However, by escalating straight to 3, the BMC will necessarily miss out on 188capturing a debug dump because there is no opportunity for software to 189intervene in the reset. Therefore, mechanism 2 should exist in the system 190design and its implementation should capture any appropriate data needed to 191debug the need to reboot and the inability to execute on approach 1. 192 193The need to escalate to 3 would indicate that the BMC's own mechanisms for 194detecting a kernel lockup have failed. Had they not failed, we would have 195ramoops or kdump data to analyse. As data cannot be captured with an escalation 196to method 3 the need to invoke it will require its own specialised debug 197experience. Given this and the kernel's own lockup detection and data 198collection mechanism, support for 2 can be implemented in BMC userspace. 199 200Mechanism 1 is typically initiated by the usual in-band interfaces, either IPMI 201or PLDM. In order to avoid these in the implementation of mechanism 2, the host 202needs an interface to the BMC that is dedicated to the role of BMC recovery, 203with minimal dependencies on the BMC side for initiating the dump collection 204and reboot. At its core, all that is needed is the ability to trigger a BMC 205IRQ, which could be as simple as monitoring a GPIO. 206 207#### Behavioural Requirements for Recovery Mechanism 2 208 209The system behaviour requirement for the mechanism is: 210 2111. The BMC executes collection of debug data and then reboots once it observes 212 a recovery message from the host 213 214It's desirable that: 215 2161. The host has some indication that the recovery process has been activated 2172. The host has some indication that a BMC reset has taken place 218 219It's necessary that: 220 2211. The host make use of a timeout to escalate to recovery mechanism 3 as it's 222 possible the BMC will be unresponsive to recovery mechanism 2 223 224#### Analysis of BMC Recovery Mechanisms for Power10 Platforms 225 226The implementation of recovery mechanism 1 is already accounted for in the 227in-band protocols between the host and the BMC and so is considered resolved 228for the purpose of the discussion. 229 230To address recovery mechanism 3, the Power10 platform designs wire up a GPIO 231driven by the host to the BMC's EXTRST pin. If the host firmware detects that 232the BMC has become unresponsive to its escalating recovery requests, it can 233drive the hardware to forcefully reset the BMC. 234 235However, host-side GPIOs are in short supply, and we do not have a dedicated 236pin to implement recovery mechanism 2 in the platform designs. 237 238#### Analysis of Implementation Methods on Power10 Platforms 239 240The implementation of recovery mechanism 2 is limited to using existing 241interfaces between the host and the BMC. These largely consist of: 242 2431. FSI 2442. LPC 2453. PCIe 246 247FSI is inappropriate because the host is the peripheral in its relationship 248with the BMC. If the BMC has become unresponsive, it is possible it's in a 249state where it would not accept FSI traffic (which it needs to drive in the 250first place) and we would need an mechanism architected into FSI for the BMC to 251recognise it is in a bad state. PCIe and LPC are preferable by comparison as 252the BMC is the peripheral in this relationship, with the host driving cycles 253into it over either interface. Comparatively, PCIe is more complex than LPC, so 254an LPC-based approach is preferred. 255 256The host already makes use of several LPC peripherals exposed from the BMC: 257 2581. Mapped LPC FW cycles 2592. iBT for IPMI 2603. The VUARTs for system and debug consoles 2614. A KCS device for a vendor-defined MCTP LPC binding 262 263The host could take advantage of any of the following LPC peripherals for 264implementing recovery mechanism 2: 265 2661. The SuperIO-based iLPC2AHB bridge 2672. The LPC mailbox 2683. An otherwise unused KCS device 269 270In ASPEED BMC SoCs prior to the AST2600 the LPC mailbox required configuration 271via the SuperIO device, which exposes the unrestricted iLPC2AHB backdoor into 272the BMC's physical address space. The iLPC2AHB capability could not be 273mitigated without disabling SuperIO support entirely, and so the ability to use 274the mailbox went with it. This security issue is resolved in the AST2600 275design, so the mailbox could be used in the Power10 platforms, but we have 276lower-complexity alternatives for generating an IRQ on the BMC. We could use 277the iLPC2AHB from the host to drive one of the watchdogs in the BMC to trigger 278a reset, but this exposes a stability risk due to the unrestricted power of the 279interface, let alone the security implications, and like the mailbox is more 280complex than the alternatives. 281 282This draws us towards the use of a KCS device, which is best aligned with the 283simple need of generating an IRQ on the BMC. AST2600 has at least 4 KCS devices 284of which one is already in use for IBM's vendor-defined MCTP LPC binding 285leaving at least 3 from which to choose. 286 287### Proposed Design 288 289The proposed design is for a simple daemon started at BMC boot to invoke the 290desired crash dump handler according to the system policy upon receiving the 291external signal. The implementation should have no IPC dependencies or 292interactions with `init`, as the reason for invoking the recovery mechanism is 293unknown and any of these interfaces might be unresponsive. 294 295A trivial implementation of the daemon is 296 297```sh 298dd if=$path bs=1 count=1 299echo c > /proc/sysrq-trigger 300``` 301 302For systems with kdump enabled, this will result in a kernel crash dump 303collection and the BMC being rebooted. 304 305A more elegant implementation might be to invoke `kexec` directly, but this 306requires the support is already available on the platform. 307 308Other activities in userspace might be feasible if it can be assumed that 309whatever failure has occurred will not prevent debug data collection, but no 310statement about this can be made in general. 311 312#### A Idealised KCS-based Protocol for Power10 Platforms 313 314The proposed implementation provides for both the required and desired 315behaviours outlined in the requirements section above. 316 317The host and BMC protocol operates as follows, starting with the BMC 318application invoked during the boot process: 319 3201. Set the `Ready` bit in STR 321 3222. Wait for an `IBF` interrupt 323 3243. Read `IDR`. The hardware clears IBF as a result 325 3264. If the read value is 0x44 (`D` for "Debug") then execute the debug dump 327 collection process and reboot. Otherwise, 328 3295. Go to step 2. 330 331On the host: 332 3331. If the `Ready` bit in STR is clear, escalate to recovery mechanism 3. 334 Otherwise, 335 3362. If the `IBF` bit in STR is set, escalate to recovery mechanism 3. Otherwise, 337 3383. Start an escalation timer 339 3404. Write 0x44 (`D` for "Debug") to the Input Data Register (IDR). The hardware 341 sets IBF as a result 342 3435. If `IBF` clears before expiry, restart the escalation timer 344 3456. If an STR read generates an LPC SYNC No Response abort, or `Ready` clears 346 before expiry, restart the escalation timer 347 3487. If `Ready` becomes set before expiry, disarm the escalation timer. Recovery 349 is complete. Otherwise, 350 3518. Escalate to recovery mechanism 3 if the escalation timer expires at any 352 point 353 354A SerIRQ is unnecessary for correct operation of the protocol. The BMC-side 355implementation is not required to emit one and the host implementation must 356behave correctly without one. Recovery is only necessary if other paths have 357failed, so STR can be read by the host when it decides recovery is required, 358and by read by time-based polling thereafter. 359 360The host must be prepared to handle LPC SYNC errors when accessing the KCS 361device IO addresses, particularly "No Response" aborts. It is not guaranteed 362that the KCS device will remain available during BMC resets. 363 364As STR is polled by the host it's not necessary for the BMC to write to ODR. 365The protocol only requires the host to write to IDR and periodically poll STR 366for changes to IBF and Ready state. This removes bi-directional dependencies. 367 368The uni-directional writes and the lack of SerIRQ reduce the features required 369for correct operation of the protocol and thus the surface area for failure of 370the recovery protocol. 371 372The layout of the KCS Status Register (STR) is as follows: 373 374| Bit | Owner | Definition | 375|-----|----------|--------------------------| 376| 7 | Software | | 377| 6 | Software | | 378| 5 | Software | | 379| 4 | Software | Ready | 380| 3 | Hardware | Command / Data | 381| 2 | Software | | 382| 1 | Hardware | Input Buffer Full (IBF) | 383| 0 | Hardware | Output Buffer Full (OBF) | 384 385#### A Real-World Implementation of the KCS Protocol for Power10 Platforms 386 387Implementing the protocol described above in userspace is challenging due to 388available kernel interfaces[1], and implementing the behaviour in the kernel 389falls afoul of the defacto "mechanism, not policy" rule of kernel support. 390 391Realistically, on the host side the only requirements are the use of a timer 392and writing the appropriate value to the Input Data Register (IDR). All the 393proposed status bits can be ignored. With this in mind, the BMC's 394implementation can be reduced to reading an appropriate value from IDR. 395Reducing requirements on the BMC's behaviour in this way allows the use of the 396`serio_raw` driver (which has the restriction that userspace can't access the 397status value). 398 399[1] https://lore.kernel.org/lkml/37e75b07-a5c6-422f-84b3-54f2bea0b917@www.fastmail.com/ 400 401#### Prototype Implementation Supporting Power10 Platforms 402 403A concrete implementation of the proposal's userspace daemon is available on 404Github: 405 406https://github.com/amboar/debug-trigger/ 407 408Deployment requires additional kernel support in the form of patches at [2]. 409 410[2] https://github.com/amboar/linux/compare/2dbb5aeba6e55e2a97e150f8371ffc1cc4d18180...for/openbmc/kcs-raw 411 412### Alternatives Considered 413 414See the discussion in Background. 415 416### Impacts 417 418The proposal has some security implications. The mechanism provides an 419unauthenticated means for the host firmware to crash and/or reboot the BMC, 420which can itself become a concern for stability and availability. Use of this 421feature requires that the host firmware is trusted, that is, that the host and 422BMC firmware must be in the same trust domain. If a platform concept requires 423that the BMC and host firmware remain in disjoint trust domains then this 424feature must not be provided by the BMC. 425 426As the feature might provide surprising system behaviour, there is an impact on 427documentation for systems deploying this design: The mechanism must be 428documented in such a way that rebooting the BMC in these circumstances isn't 429surprising. 430 431Developers are impacted in the sense that they may have access to better debug 432data than might otherwise be possible. There are no obvious developer-specific 433drawbacks. 434 435Due to simplicity being a design-point of the proposal, there are no 436significant API, performance or upgradability impacts. 437 438### Testing 439 440Generally, testing this feature requires complex interactions with 441host firmware and platform-specific mechanisms for triggering the reboot 442behaviour. 443 444For Power10 platforms this feature may be safely tested under QEMU by scripting 445the monitor to inject values on the appropriate KCS device. Implementing this 446for automated testing may need explicit support in CI. 447 448 449## Handling platform-data-provider failures 450 451### Requirements 452 453As noted above, these types of failures usually yield a system that can continue 454to operate in a reduced capacity. The desired behavior in this scenario can 455vary from system to system so the requirements in this area need to be flexible 456enough to allow system owners to configure their desired behavior. 457 458The requirements for OpenBMC when a platform-data-provider service enters a 459failure state are that the BMC: 460- Logs an error indicating a service has failed 461- Collects a BMC dump 462- Changes BMC state (CurrentBMCState) to indicate a degraded mode of the BMC 463- Allow system owners to customize other behaviors (i.e. BMC reboot) 464 465### Proposed Design 466 467This will build upon the existing [target-fail-monitoring][1] design. The 468monitor service will be enhanced to also take json file(s) which list 469critical services to monitor. 470 471Define a "obmc-bmc-service-quiesce.target". System owners can install any 472other services they wish in this new target. 473 474phosphor-bmc-state-manager will monitor this target and enter a `Quiesced` 475state when it is started. This state will be reported externally via the 476Redfish API under redfish/v1/Managers/bmc status property. 477 478This would look like the following: 479- In a services-to-monitor configuration file, add all critical services 480- The state-manager service-monitor will subscribe to signals for service 481 failures and do the following when one fails from within the configuration 482 file: 483 - Log error with service failure information 484 - Request a BMC dump 485 - Start obmc-bmc-service-quiesce.target 486- BMC state manager detects obmc-bmc-service-quiesce.target has started and puts 487 the BMC state into Quiesced 488- bmcweb looks at BMC state to return appropriate state to external clients 489 490[1]: https://github.com/openbmc/docs/blob/master/designs/target-fail-monitoring.md 491 492### Alternatives Considered 493 494One simpler option would be to just have the OnFailure result in a BMC reboot 495but historically this has caused more problems then it solves: 496- Rarely does a BMC reboot fix a service that was not fixed by simply restarting 497 it. 498- A BMC that continuously reboots itself due to a service failure is very 499 difficult to debug. 500- Some BMC's only allow a certain amount of reboots so eventually the 501 BMC ends up stuck in the boot loader which is inaccessible unless special 502 debug cables are available so for all intents and purposes your system is now 503 unusable. 504 505### Impacts 506 507Currently nothing happens when a service enters the fail state. The changes 508proposed in this document will ensure an error is logged a dump is collected, 509and the external BMC state reflects the failure when this occurs. 510 511### Testing 512 513A variety of service should be put into the fail state and the tester should 514ensure the appropriate error is logged, dump is collected, and BMC state 515is changed to reflect this. 516