1# BMC Service Failure Debug and Recovery 2 3Author: 4Andrew Jeffery <andrew@aj.id.au> @arj 5 6Primary Assignee: 7Andrew Jeffery <andrew@aj.id.au> @arj 8 9Created: 106th May 2021 11 12## Problem Description 13 14The capability to debug critical failures of the BMC firmware is essential to 15meet the reliability and serviceability claims made for some platforms. 16 17A class of failure exists under which we can attempt debug data collection 18despite being unable to communicate with the BMC via standard protocols. 19 20This proposal argues for and proposes a software-driven debug data capture 21and recovery of a failed BMC. 22 23## Background and References 24 25By necessity, BMCs are not self-contained systems. BMCs exist to service the 26needs of both the host system by providing in-band platform services such as 27thermal and power management as well as system operators by providing 28out-of-band system management interfaces such as error reporting, platform 29telemetry and firmware management. 30 31As such, failures of BMC subsystems may impact external consumers. 32 33The BMC firmware stack is not trivial, in the sense that common implementations 34are usually a domain-specific Linux distributions with complex or highly 35coupled relationships to platform subsystems. 36 37Complexity and coupling drive concern around the risk of critical failures in 38the BMC firmware. The BMC firmware design should provide for resilience and 39recovery in the face of well-defined error conditions, but the need to mitigate 40ill-defined error conditions or entering unintended software states remains. 41 42The ability for a system to recover in the face of an error condition depends 43on its ability to detect the failure. Thus, error conditions can be assigned to 44various classes based on the ability to externally observe the error: 45 461. Continued operation: The services detects the error and performs the actions 47 required to return to its operating state 48 492. Graceful exit: The service detects an error it cannot recover from, but 50 gracefully cleans up its resources before exiting with an appropriate exit 51 status 52 533. Crash: The service detects it is an unintended software state and exits 54 immediately, failing to gracefully clean up its resources before exiting 55 564. Unresponsive: The service fails to detect it cannot make progress and 57 continues to run but is unresponsive 58 59 60As the state transformations to enter the ill-defined or unintended software 61state are unanticipated, the actions required to gracefully return to an 62expected state are also not well defined. The general approaches to recover a 63system or service to a known state in the face of entering an unknown state 64are: 65 661. Restart the affected service 672. Restart the affected set of services 683. Restart all services 69 70In the face of continued operation due to internal recovery a service restart 71is unnecessary, while in the case of a unresponsive service the need to restart 72cannot be detected by service state alone. Implementation of resiliency by way 73of service restarts via a service manager is only possible in the face of a 74graceful exit or application crash. Handling of services that have entered an 75unresponsive state can only begin upon receiving external input. 76 77Like error conditions, services exposed by the BMC can be divided into several 78external interface classes: 79 801. Providers of platform data 812. Providers of platform data transports 82 83Examples of the first are applications that expose various platform sensors or 84provide data about the firmware itself. Failure of the first class of 85applications usually yields a system that can continue to operate in a reduced 86capacity. 87 88Examples of the second are the operating system itself and applications that 89implement IPMI, HTTPS (e.g. for Redfish), MCTP and PLDM. This second class also 90covers implementation-specific data transports such as D-Bus, which requires a 91broker service. Failure of a platform data transport may result in one or all 92external interfaces becoming unresponsive and be viewed as a critical failure 93of the BMC. 94 95Like error conditions and services, the BMC's external interfaces can be 96divided into several classes: 97 981. Out-of-band interfaces: Remote, operator-driven platform management 992. In-band interfaces: Local, host-firmware-driven platform management 100 101Failures of platform data transports generally leave out-of-band interfaces 102unresponsive to the point that the BMC cannot be recovered except via external 103means, usually by issuing a (disruptive) AC power cycle. On the other hand, if 104the host can detect the BMC is unresponsive on the in-band interface(s), an 105appropriate platform design can enable the host to reset the BMC without 106disrupting its own operation. 107 108### Analysis of eBMC Error State Management and Mitigation Mechanisms 109 110Assessing OpenBMC userspace with respect to the error classes outlined above, 111the system manages and mitigates error conditions as follows: 112 113| Condition | Mechanism | 114|---------------------|-----------------------------------------------------| 115| Continued operation | Application-specific error handling | 116| Graceful exit | Application-specific error handling | 117| Crash | Signal, unhandled exceptions, `assert()`, `abort()` | 118| Unresponsive | None | 119 120These mechanisms inform systemd (the service manager) of an event, which it 121handles according to the restart policy encoded in the unit file for the 122service. 123 124Assessing the OpenBMC operating system with respect to the error classes, it 125manages and mitigates error conditions as follows: 126 127| Condition | Mechanism | 128|---------------------|----------------------------------------| 129| Continued operation | ramoops, ftrace, `printk()` | 130| Graceful exit | System reboot | 131| Crash | kdump or ramoops | 132| Unresponsive | `hardlockup_panic`, `softlockup_panic` | 133 134Crash conditions in the Linux kernel trigger panics, which are handled by kdump 135(though may be handled by ramoops until kdump support is integrated). Kernel 136lockup conditions can be configured to trigger panics, which in-turn trigger 137either ramoops or kdump. 138 139### Synthesis 140 141In the context of the information above, handling of application lock-up error 142conditions is not provided. For applications in the platform-data-provider 143class of external interfaces, the system will continue to operate with reduced 144functionality. For applications in the platform-data-transport-provider class, 145this represents a critical failure of the firmware that must have accompanying 146debug data. 147 148## Requirements 149 150### Recovery Mechanisms 151 152The ability for external consumers to control the recovery behaviour of BMC 153services is usually coarse, the nuanced handling is left to the BMC 154implementation. Where available the options for external consumer tend to be, 155in ascending order of severity: 156 157| Severity | BMC Recovery Mechanism | Used for | 158|----------|-------------------------|-----------------------------------------------------------------------| 159| 1 | Graceful reboot request | Normal circumstances or recovery from platform data provider failures | 160| 2 | Forceful reboot request | Recovery from unresponsive platform data transport providers | 161| 3 | External hardware reset | Unresponsive operating system | 162 163Of course it's not possible to issue these requests over interfaces that are 164unresponsive. A robust platform design should be capable of issuing all three 165restart requests over separate interfaces to minimise the impact of any one 166interface becoming unresponsive. Further, the more severe the reset type, the 167fewer dependencies should be in its execution path. 168 169Given the out-of-band path is often limited to just the network, it's not 170feasible for the BMC to provide any of the above in the event of some kind of 171network or relevant data transport failure. The considerations here are 172therefore limited to recovery of unresponsive in-band interfaces. 173 174The need to escalate above mechanism 1 should come with data that captures why 175it was necessary, i.e. dumps for services that failed in the path for 1. 176However, by escalating straight to 3, the BMC will necessarily miss out on 177capturing a debug dump because there is no opportunity for software to 178intervene in the reset. Therefore, mechanism 2 should exist in the system 179design and its implementation should capture any appropriate data needed to 180debug the need to reboot and the inability to execute on approach 1. 181 182The need to escalate to 3 would indicate that the BMC's own mechanisms for 183detecting a kernel lockup have failed. Had they not failed, we would have 184ramoops or kdump data to analyse. As data cannot be captured with an escalation 185to method 3 the need to invoke it will require its own specialised debug 186experience. Given this and the kernel's own lockup detection and data 187collection mechanism, support for 2 can be implemented in BMC userspace. 188 189Mechanism 1 is typically initiated by the usual in-band interfaces, either IPMI 190or PLDM. In order to avoid these in the implementation of mechanism 2, the host 191needs an interface to the BMC that is dedicated to the role of BMC recovery, 192with minimal dependencies on the BMC side for initiating the dump collection 193and reboot. At its core, all that is needed is the ability to trigger a BMC 194IRQ, which could be as simple as monitoring a GPIO. 195 196### Behavioural Requirements for Recovery Mechanism 2 197 198The system behaviour requirement for the mechanism is: 199 2001. The BMC executes collection of debug data and then reboots once it observes 201 a recovery message from the host 202 203It's desirable that: 204 2051. The host has some indication that the recovery process has been activated 2062. The host has some indication that a BMC reset has taken place 207 208It's necessary that: 209 2101. The host make use of a timeout to escalate to recovery mechanism 3 as it's 211 possible the BMC will be unresponsive to recovery mechanism 2 212 213### Analysis of BMC Recovery Mechanisms for Power10 Platforms 214 215The implementation of recovery mechanism 1 is already accounted for in the 216in-band protocols between the host and the BMC and so is considered resolved 217for the purpose of the discussion. 218 219To address recovery mechanism 3, the Power10 platform designs wire up a GPIO 220driven by the host to the BMC's EXTRST pin. If the host firmware detects that 221the BMC has become unresponsive to its escalating recovery requests, it can 222drive the hardware to forcefully reset the BMC. 223 224However, host-side GPIOs are in short supply, and we do not have a dedicated 225pin to implement recovery mechanism 2 in the platform designs. 226 227### Analysis of Implementation Methods on Power10 Platforms 228 229The implementation of recovery mechanism 2 is limited to using existing 230interfaces between the host and the BMC. These largely consist of: 231 2321. FSI 2332. LPC 2343. PCIe 235 236FSI is inappropriate because the host is the peripheral in its relationship 237with the BMC. If the BMC has become unresponsive, it is possible it's in a 238state where it would not accept FSI traffic (which it needs to drive in the 239first place) and we would need an mechanism architected into FSI for the BMC to 240recognise it is in a bad state. PCIe and LPC are preferable by comparison as 241the BMC is the peripheral in this relationship, with the host driving cycles 242into it over either interface. Comparatively, PCIe is more complex than LPC, so 243an LPC-based approach is preferred. 244 245The host already makes use of several LPC peripherals exposed from the BMC: 246 2471. Mapped LPC FW cycles 2482. iBT for IPMI 2493. The VUARTs for system and debug consoles 2504. A KCS device for a vendor-defined MCTP LPC binding 251 252The host could take advantage of any of the following LPC peripherals for 253implementing recovery mechanism 2: 254 2551. The SuperIO-based iLPC2AHB bridge 2562. The LPC mailbox 2573. An otherwise unused KCS device 258 259In ASPEED BMC SoCs prior to the AST2600 the LPC mailbox required configuration 260via the SuperIO device, which exposes the unrestricted iLPC2AHB backdoor into 261the BMC's physical address space. The iLPC2AHB capability could not be 262mitigated without disabling SuperIO support entirely, and so the ability to use 263the mailbox went with it. This security issue is resolved in the AST2600 264design, so the mailbox could be used in the Power10 platforms, but we have 265lower-complexity alternatives for generating an IRQ on the BMC. We could use 266the iLPC2AHB from the host to drive one of the watchdogs in the BMC to trigger 267a reset, but this exposes a stability risk due to the unrestricted power of the 268interface, let alone the security implications, and like the mailbox is more 269complex than the alternatives. 270 271This draws us towards the use of a KCS device, which is best aligned with the 272simple need of generating an IRQ on the BMC. AST2600 has at least 4 KCS devices 273of which one is already in use for IBM's vendor-defined MCTP LPC binding 274leaving at least 3 from which to choose. 275 276## Proposed Design 277 278The proposed design is for a simple daemon started at BMC boot to invoke the 279desired crash dump handler according to the system policy upon receiving the 280external signal. The implementation should have no IPC dependencies or 281interactions with `init`, as the reason for invoking the recovery mechanism is 282unknown and any of these interfaces might be unresponsive. 283 284A trivial implementation of the daemon is 285 286```sh 287dd if=$path bs=1 count=1 288echo c > /proc/sysrq-trigger 289``` 290 291For systems with kdump enabled, this will result in a kernel crash dump 292collection and the BMC being rebooted. 293 294A more elegant implementation might be to invoke `kexec` directly, but this 295requires the support is already available on the platform. 296 297Other activities in userspace might be feasible if it can be assumed that 298whatever failure has occurred will not prevent debug data collection, but no 299statement about this can be made in general. 300 301### A Idealised KCS-based Protocol for Power10 Platforms 302 303The proposed implementation provides for both the required and desired 304behaviours outlined in the requirements section above. 305 306The host and BMC protocol operates as follows, starting with the BMC 307application invoked during the boot process: 308 3091. Set the `Ready` bit in STR 310 3112. Wait for an `IBF` interrupt 312 3133. Read `IDR`. The hardware clears IBF as a result 314 3154. If the read value is 0x44 (`D` for "Debug") then execute the debug dump 316 collection process and reboot. Otherwise, 317 3185. Go to step 2. 319 320On the host: 321 3221. If the `Ready` bit in STR is clear, escalate to recovery mechanism 3. 323 Otherwise, 324 3252. If the `IBF` bit in STR is set, escalate to recovery mechanism 3. Otherwise, 326 3273. Start an escalation timer 328 3294. Write 0x44 (`D` for "Debug") to the Input Data Register (IDR). The hardware 330 sets IBF as a result 331 3325. If `IBF` clears before expiry, restart the escalation timer 333 3346. If an STR read generates an LPC SYNC No Response abort, or `Ready` clears 335 before expiry, restart the escalation timer 336 3377. If `Ready` becomes set before expiry, disarm the escalation timer. Recovery 338 is complete. Otherwise, 339 3408. Escalate to recovery mechanism 3 if the escalation timer expires at any 341 point 342 343A SerIRQ is unnecessary for correct operation of the protocol. The BMC-side 344implementation is not required to emit one and the host implementation must 345behave correctly without one. Recovery is only necessary if other paths have 346failed, so STR can be read by the host when it decides recovery is required, 347and by read by time-based polling thereafter. 348 349The host must be prepared to handle LPC SYNC errors when accessing the KCS 350device IO addresses, particularly "No Response" aborts. It is not guaranteed 351that the KCS device will remain available during BMC resets. 352 353As STR is polled by the host it's not necessary for the BMC to write to ODR. 354The protocol only requires the host to write to IDR and periodically poll STR 355for changes to IBF and Ready state. This removes bi-directional dependencies. 356 357The uni-directional writes and the lack of SerIRQ reduce the features required 358for correct operation of the protocol and thus the surface area for failure of 359the recovery protocol. 360 361The layout of the KCS Status Register (STR) is as follows: 362 363| Bit | Owner | Definition | 364|-----|----------|--------------------------| 365| 7 | Software | | 366| 6 | Software | | 367| 5 | Software | | 368| 4 | Software | Ready | 369| 3 | Hardware | Command / Data | 370| 2 | Software | | 371| 1 | Hardware | Input Buffer Full (IBF) | 372| 0 | Hardware | Output Buffer Full (OBF) | 373 374### A Real-World Implementation of the KCS Protocol for Power10 Platforms 375 376Implementing the protocol described above in userspace is challenging due to 377available kernel interfaces[1], and implementing the behaviour in the kernel 378falls afoul of the defacto "mechanism, not policy" rule of kernel support. 379 380Realistically, on the host side the only requirements are the use of a timer 381and writing the appropriate value to the Input Data Register (IDR). All the 382proposed status bits can be ignored. With this in mind, the BMC's 383implementation can be reduced to reading an appropriate value from IDR. 384Reducing requirements on the BMC's behaviour in this way allows the use of the 385`serio_raw` driver (which has the restriction that userspace can't access the 386status value). 387 388[1] https://lore.kernel.org/lkml/37e75b07-a5c6-422f-84b3-54f2bea0b917@www.fastmail.com/ 389 390### Prototype Implementation Supporting Power10 Platforms 391 392A concrete implementation of the proposal's userspace daemon is available on 393Github: 394 395https://github.com/amboar/debug-trigger/ 396 397Deployment requires additional kernel support in the form of patches at [2]. 398 399[2] https://github.com/amboar/linux/compare/2dbb5aeba6e55e2a97e150f8371ffc1cc4d18180...for/openbmc/kcs-raw 400 401## Alternatives Considered 402 403See the discussion in Background. 404 405## Impacts 406 407The proposal has some security implications. The mechanism provides an 408unauthenticated means for the host firmware to crash and/or reboot the BMC, 409which can itself become a concern for stability and availability. Use of this 410feature requires that the host firmware is trusted, that is, that the host and 411BMC firmware must be in the same trust domain. If a platform concept requires 412that the BMC and host firmware remain in disjoint trust domains then this 413feature must not be provided by the BMC. 414 415As the feature might provide surprising system behaviour, there is an impact on 416documentation for systems deploying this design: The mechanism must be 417documented in such a way that rebooting the BMC in these circumstances isn't 418surprising. 419 420Developers are impacted in the sense that they may have access to better debug 421data than might otherwise be possible. There are no obvious developer-specific 422drawbacks. 423 424Due to simplicity being a design-point of the proposal, there are no 425significant API, performance or upgradability impacts. 426 427## Testing 428 429Generally, testing this feature requires complex interactions with 430host firmware and platform-specific mechanisms for triggering the reboot 431behaviour. 432 433For Power10 platforms this feature may be safely tested under QEMU by scripting 434the monitor to inject values on the appropriate KCS device. Implementing this 435for automated testing may need explicit support in CI. 436