1# Design proposal for issuing NMI on servers that use OpenBMC
2
3Author: Lakshminarayana Kammath
4
5Primary assignee: Lakshminarayana Kammath
6
7Other contributors: Jayanth Othayoth
8
9Created: 2019-05-21
10
11
12## Problem Description
13Currently, servers that use OpenBMC cannot have the ability to capture relevant
14debug data when the host is unresponsive or hung. These systems need the ability
15to diagnose the root cause of hang and perform recovery along with debugging
16data collected.
17
18
19## Background and References
20There is a situation at customer places/lab where the host goes unresponsive
21causing system hang(https://github.com/ibm-openbmc/dev/issues/457).
22This means there is no way to figure out what went wrong with the host in a hung
23state. One has to recover the system with no relevant debug data captured.
24
25Whenever the host is unresponsive/running, Admin needs to trigger an NMI event
26which, in turn, triggers an architecture-dependent procedure that fires an
27interrupt on all the available processors on the system.
28
29## Proposed Design for servers that use OpenBMC
30This proposal aims to trigger NMI, which in turn will invoke an
31architecture-specific procedure that enables data collection followed by recovery
32of the Host. This will enable Host/OS development teams to analyze and fix any
33issues where they see host hang and unresponsive system.
34
35### D-Bus
36Introducing new D-Bus interface in the control.host namespace
37(/openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/Control/Host/
38NMI.interface.yaml)
39and implement the new D-Bus back-end for respective processor specific targets.
40
41### BMC Support
42Enable NMI D-Bus phosphor interface and support this via Redfish
43
44### Redfish Schema used
45* Reference: DSP2046 2018.3,
46* ComputerSystem 1.6.0 schema provides an action called #ComputerSystem.Reset,
47  This action is used to reset the system.
48  The ResetType parameter is used for indicating the type of reset needs to be
49  performed. In this case, we can use
50  An NMI type
51    * Nmi: Generate a Diagnostic Interrupt (usually an NMI on x86 systems)
52     to cease normal operations, perform diagnostic actions and typically
53     halt the system.
54
55## High-Level Flow
561. Host/OS is hung or unresponsive or one need to take kernel dump
57   to debug some error conditions.
582. Admin/User can use the Redfish URI ComputerSystem.Reset that allows
59   POST operations and change the Action and ResetType properties to
60   {"Action":"ComputerSystem.Reset","ResetType":"Nmi"} to trigger NMI.
613. Redfish URI will invoke a D-Bus NMI back-end call which will use an arch
62   specific back-end implementation of xyz.openbmc_project.Control.Host.NMI
63   to trigger an NMI on all the processors on the system.
644. On receiving the NMI, the host will automatically invoke Architecture specific
65   actions. One such action could be; invoking the kdump followed by the reboot.
66
67* Note: NMI can be sent to the host in any state, not just at an unresponsive
68  state.
69
70## Alternatives Considered
71Extending  the existing  D-Bus interface state.Host namespace
72(/openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/State/Host.interface.yaml)
73to support new RequestedHostTransition property called Nmi.
74D-Bus back-end can internally invoke processor-specific target to invoke NMI
75and do associated actions.
76
77There were strong reasons to move away from the above approach.
78phosphor-state-manager has always been focused on the states of the BMC,
79Chassis, and Host. NMI will be more of action against the host than
80a state.
81
82## Impacts
83This implementation only needs to make some changes to the system state when
84NMI is initiated irrespective of what host OS state is in, so it has minimal
85impact on the rest of the system.
86
87## Testing
88Depending on the platform hardware design, this test requires a host OS kernel
89module driver to create hard lockup/hang and then check the scenario is good.
90Also, one can invoke NMI to get the crash dump and confirm HOST received NMI
91via console logs.
92
93