1# Design proposal for issuing NMI on servers that use OpenBMC 2 3Author: Lakshminarayana Kammath 4 5Other contributors: Jayanth Othayoth 6 7Created: 2019-05-21 8 9## Problem Description 10 11Currently, servers that use OpenBMC cannot have the ability to capture relevant 12debug data when the host is unresponsive or hung. These systems need the ability 13to diagnose the root cause of hang and perform recovery along with debugging 14data collected. 15 16## Background and References 17 18There is a situation at customer places/lab where the host goes unresponsive 19causing system hang(https://github.com/ibm-openbmc/dev/issues/457). This means 20there is no way to figure out what went wrong with the host in a hung state. One 21has to recover the system with no relevant debug data captured. 22 23Whenever the host is unresponsive/running, Admin needs to trigger an NMI event 24which, in turn, triggers an architecture-dependent procedure that fires an 25interrupt on all the available processors on the system. 26 27## Proposed Design for servers that use OpenBMC 28 29This proposal aims to trigger NMI, which in turn will invoke an 30architecture-specific procedure that enables data collection followed by 31recovery of the Host. This will enable Host/OS development teams to analyze and 32fix any issues where they see host hang and unresponsive system. 33 34### D-Bus 35 36Introducing new D-Bus interface in the control.host namespace 37(/openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/Control/Host/ 38NMI.interface.yaml) and implement the new D-Bus back-end for respective 39processor specific targets. 40 41### BMC Support 42 43Enable NMI D-Bus phosphor interface and support this via Redfish 44 45### Redfish Schema used 46 47- Reference: DSP2046 2018.3, 48- ComputerSystem 1.6.0 schema provides an action called #ComputerSystem.Reset, 49 This action is used to reset the system. The ResetType parameter is used for 50 indicating the type of reset needs to be performed. In this case, we can use 51 An NMI type 52 - Nmi: Generate a Diagnostic Interrupt (usually an NMI on x86 systems) to 53 cease normal operations, perform diagnostic actions and typically halt the 54 system. 55 56## High-Level Flow 57 581. Host/OS is hung or unresponsive or one need to take kernel dump to debug some 59 error conditions. 602. Admin/User can use the Redfish URI ComputerSystem.Reset that allows POST 61 operations and change the Action and ResetType properties to 62 {"Action":"ComputerSystem.Reset","ResetType":"Nmi"} to trigger NMI. 633. Redfish URI will invoke a D-Bus NMI back-end call which will use an arch 64 specific back-end implementation of xyz.openbmc_project.Control.Host.NMI to 65 trigger an NMI on all the processors on the system. 664. On receiving the NMI, the host will automatically invoke Architecture 67 specific actions. One such action could be; invoking the kdump followed by 68 the reboot. 69 70- Note: NMI can be sent to the host in any state, not just at an unresponsive 71 state. 72 73## Alternatives Considered 74 75Extending the existing D-Bus interface state.Host namespace 76(/openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/State/Host.interface.yaml) 77to support new RequestedHostTransition property called Nmi. D-Bus back-end can 78internally invoke processor-specific target to invoke NMI and do associated 79actions. 80 81There were strong reasons to move away from the above approach. 82phosphor-state-manager has always been focused on the states of the BMC, 83Chassis, and Host. NMI will be more of action against the host than a state. 84 85## Impacts 86 87This implementation only needs to make some changes to the system state when NMI 88is initiated irrespective of what host OS state is in, so it has minimal impact 89on the rest of the system. 90 91## Testing 92 93Depending on the platform hardware design, this test requires a host OS kernel 94module driver to create hard lockup/hang and then check the scenario is good. 95Also, one can invoke NMI to get the crash dump and confirm HOST received NMI via 96console logs. 97