12654988cSPatrick Williams# Design proposal for issuing NMI on servers that use OpenBMC 22654988cSPatrick Williams 32654988cSPatrick WilliamsAuthor: Lakshminarayana Kammath 42654988cSPatrick Williams 52654988cSPatrick WilliamsOther contributors: Jayanth Othayoth 62654988cSPatrick Williams 72654988cSPatrick WilliamsCreated: 2019-05-21 82654988cSPatrick Williams 92654988cSPatrick Williams## Problem Description 10*f4febd00SPatrick Williams 112654988cSPatrick WilliamsCurrently, servers that use OpenBMC cannot have the ability to capture relevant 122654988cSPatrick Williamsdebug data when the host is unresponsive or hung. These systems need the ability 132654988cSPatrick Williamsto diagnose the root cause of hang and perform recovery along with debugging 142654988cSPatrick Williamsdata collected. 152654988cSPatrick Williams 162654988cSPatrick Williams## Background and References 17*f4febd00SPatrick Williams 182654988cSPatrick WilliamsThere is a situation at customer places/lab where the host goes unresponsive 19*f4febd00SPatrick Williamscausing system hang(https://github.com/ibm-openbmc/dev/issues/457). This means 20*f4febd00SPatrick Williamsthere is no way to figure out what went wrong with the host in a hung state. One 21*f4febd00SPatrick Williamshas to recover the system with no relevant debug data captured. 222654988cSPatrick Williams 232654988cSPatrick WilliamsWhenever the host is unresponsive/running, Admin needs to trigger an NMI event 242654988cSPatrick Williamswhich, in turn, triggers an architecture-dependent procedure that fires an 252654988cSPatrick Williamsinterrupt on all the available processors on the system. 262654988cSPatrick Williams 272654988cSPatrick Williams## Proposed Design for servers that use OpenBMC 28*f4febd00SPatrick Williams 292654988cSPatrick WilliamsThis proposal aims to trigger NMI, which in turn will invoke an 30*f4febd00SPatrick Williamsarchitecture-specific procedure that enables data collection followed by 31*f4febd00SPatrick Williamsrecovery of the Host. This will enable Host/OS development teams to analyze and 32*f4febd00SPatrick Williamsfix any issues where they see host hang and unresponsive system. 332654988cSPatrick Williams 342654988cSPatrick Williams### D-Bus 35*f4febd00SPatrick Williams 362654988cSPatrick WilliamsIntroducing new D-Bus interface in the control.host namespace 372654988cSPatrick Williams(/openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/Control/Host/ 38*f4febd00SPatrick WilliamsNMI.interface.yaml) and implement the new D-Bus back-end for respective 39*f4febd00SPatrick Williamsprocessor specific targets. 402654988cSPatrick Williams 412654988cSPatrick Williams### BMC Support 42*f4febd00SPatrick Williams 432654988cSPatrick WilliamsEnable NMI D-Bus phosphor interface and support this via Redfish 442654988cSPatrick Williams 452654988cSPatrick Williams### Redfish Schema used 46*f4febd00SPatrick Williams 47*f4febd00SPatrick Williams- Reference: DSP2046 2018.3, 48*f4febd00SPatrick Williams- ComputerSystem 1.6.0 schema provides an action called #ComputerSystem.Reset, 49*f4febd00SPatrick Williams This action is used to reset the system. The ResetType parameter is used for 50*f4febd00SPatrick Williams indicating the type of reset needs to be performed. In this case, we can use 512654988cSPatrick Williams An NMI type 52*f4febd00SPatrick Williams - Nmi: Generate a Diagnostic Interrupt (usually an NMI on x86 systems) to 53*f4febd00SPatrick Williams cease normal operations, perform diagnostic actions and typically halt the 54*f4febd00SPatrick Williams system. 552654988cSPatrick Williams 562654988cSPatrick Williams## High-Level Flow 57*f4febd00SPatrick Williams 58*f4febd00SPatrick Williams1. Host/OS is hung or unresponsive or one need to take kernel dump to debug some 59*f4febd00SPatrick Williams error conditions. 60*f4febd00SPatrick Williams2. Admin/User can use the Redfish URI ComputerSystem.Reset that allows POST 61*f4febd00SPatrick Williams operations and change the Action and ResetType properties to 622654988cSPatrick Williams {"Action":"ComputerSystem.Reset","ResetType":"Nmi"} to trigger NMI. 632654988cSPatrick Williams3. Redfish URI will invoke a D-Bus NMI back-end call which will use an arch 64*f4febd00SPatrick Williams specific back-end implementation of xyz.openbmc_project.Control.Host.NMI to 65*f4febd00SPatrick Williams trigger an NMI on all the processors on the system. 66*f4febd00SPatrick Williams4. On receiving the NMI, the host will automatically invoke Architecture 67*f4febd00SPatrick Williams specific actions. One such action could be; invoking the kdump followed by 68*f4febd00SPatrick Williams the reboot. 692654988cSPatrick Williams 70*f4febd00SPatrick Williams- Note: NMI can be sent to the host in any state, not just at an unresponsive 712654988cSPatrick Williams state. 722654988cSPatrick Williams 732654988cSPatrick Williams## Alternatives Considered 74*f4febd00SPatrick Williams 752654988cSPatrick WilliamsExtending the existing D-Bus interface state.Host namespace 762654988cSPatrick Williams(/openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/State/Host.interface.yaml) 77*f4febd00SPatrick Williamsto support new RequestedHostTransition property called Nmi. D-Bus back-end can 78*f4febd00SPatrick Williamsinternally invoke processor-specific target to invoke NMI and do associated 79*f4febd00SPatrick Williamsactions. 802654988cSPatrick Williams 812654988cSPatrick WilliamsThere were strong reasons to move away from the above approach. 822654988cSPatrick Williamsphosphor-state-manager has always been focused on the states of the BMC, 83*f4febd00SPatrick WilliamsChassis, and Host. NMI will be more of action against the host than a state. 842654988cSPatrick Williams 852654988cSPatrick Williams## Impacts 86*f4febd00SPatrick Williams 87*f4febd00SPatrick WilliamsThis implementation only needs to make some changes to the system state when NMI 88*f4febd00SPatrick Williamsis initiated irrespective of what host OS state is in, so it has minimal impact 89*f4febd00SPatrick Williamson the rest of the system. 902654988cSPatrick Williams 912654988cSPatrick Williams## Testing 92*f4febd00SPatrick Williams 932654988cSPatrick WilliamsDepending on the platform hardware design, this test requires a host OS kernel 942654988cSPatrick Williamsmodule driver to create hard lockup/hang and then check the scenario is good. 95*f4febd00SPatrick WilliamsAlso, one can invoke NMI to get the crash dump and confirm HOST received NMI via 96*f4febd00SPatrick Williamsconsole logs. 97