12654988cSPatrick Williams# Design proposal for issuing NMI on servers that use OpenBMC
22654988cSPatrick Williams
32654988cSPatrick WilliamsAuthor: Lakshminarayana Kammath
42654988cSPatrick Williams
52654988cSPatrick WilliamsOther contributors: Jayanth Othayoth
62654988cSPatrick Williams
72654988cSPatrick WilliamsCreated: 2019-05-21
82654988cSPatrick Williams
92654988cSPatrick Williams## Problem Description
10*f4febd00SPatrick Williams
112654988cSPatrick WilliamsCurrently, servers that use OpenBMC cannot have the ability to capture relevant
122654988cSPatrick Williamsdebug data when the host is unresponsive or hung. These systems need the ability
132654988cSPatrick Williamsto diagnose the root cause of hang and perform recovery along with debugging
142654988cSPatrick Williamsdata collected.
152654988cSPatrick Williams
162654988cSPatrick Williams## Background and References
17*f4febd00SPatrick Williams
182654988cSPatrick WilliamsThere is a situation at customer places/lab where the host goes unresponsive
19*f4febd00SPatrick Williamscausing system hang(https://github.com/ibm-openbmc/dev/issues/457). This means
20*f4febd00SPatrick Williamsthere is no way to figure out what went wrong with the host in a hung state. One
21*f4febd00SPatrick Williamshas to recover the system with no relevant debug data captured.
222654988cSPatrick Williams
232654988cSPatrick WilliamsWhenever the host is unresponsive/running, Admin needs to trigger an NMI event
242654988cSPatrick Williamswhich, in turn, triggers an architecture-dependent procedure that fires an
252654988cSPatrick Williamsinterrupt on all the available processors on the system.
262654988cSPatrick Williams
272654988cSPatrick Williams## Proposed Design for servers that use OpenBMC
28*f4febd00SPatrick Williams
292654988cSPatrick WilliamsThis proposal aims to trigger NMI, which in turn will invoke an
30*f4febd00SPatrick Williamsarchitecture-specific procedure that enables data collection followed by
31*f4febd00SPatrick Williamsrecovery of the Host. This will enable Host/OS development teams to analyze and
32*f4febd00SPatrick Williamsfix any issues where they see host hang and unresponsive system.
332654988cSPatrick Williams
342654988cSPatrick Williams### D-Bus
35*f4febd00SPatrick Williams
362654988cSPatrick WilliamsIntroducing new D-Bus interface in the control.host namespace
372654988cSPatrick Williams(/openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/Control/Host/
38*f4febd00SPatrick WilliamsNMI.interface.yaml) and implement the new D-Bus back-end for respective
39*f4febd00SPatrick Williamsprocessor specific targets.
402654988cSPatrick Williams
412654988cSPatrick Williams### BMC Support
42*f4febd00SPatrick Williams
432654988cSPatrick WilliamsEnable NMI D-Bus phosphor interface and support this via Redfish
442654988cSPatrick Williams
452654988cSPatrick Williams### Redfish Schema used
46*f4febd00SPatrick Williams
47*f4febd00SPatrick Williams- Reference: DSP2046 2018.3,
48*f4febd00SPatrick Williams- ComputerSystem 1.6.0 schema provides an action called #ComputerSystem.Reset,
49*f4febd00SPatrick Williams  This action is used to reset the system. The ResetType parameter is used for
50*f4febd00SPatrick Williams  indicating the type of reset needs to be performed. In this case, we can use
512654988cSPatrick Williams  An NMI type
52*f4febd00SPatrick Williams  - Nmi: Generate a Diagnostic Interrupt (usually an NMI on x86 systems) to
53*f4febd00SPatrick Williams    cease normal operations, perform diagnostic actions and typically halt the
54*f4febd00SPatrick Williams    system.
552654988cSPatrick Williams
562654988cSPatrick Williams## High-Level Flow
57*f4febd00SPatrick Williams
58*f4febd00SPatrick Williams1. Host/OS is hung or unresponsive or one need to take kernel dump to debug some
59*f4febd00SPatrick Williams   error conditions.
60*f4febd00SPatrick Williams2. Admin/User can use the Redfish URI ComputerSystem.Reset that allows POST
61*f4febd00SPatrick Williams   operations and change the Action and ResetType properties to
622654988cSPatrick Williams   {"Action":"ComputerSystem.Reset","ResetType":"Nmi"} to trigger NMI.
632654988cSPatrick Williams3. Redfish URI will invoke a D-Bus NMI back-end call which will use an arch
64*f4febd00SPatrick Williams   specific back-end implementation of xyz.openbmc_project.Control.Host.NMI to
65*f4febd00SPatrick Williams   trigger an NMI on all the processors on the system.
66*f4febd00SPatrick Williams4. On receiving the NMI, the host will automatically invoke Architecture
67*f4febd00SPatrick Williams   specific actions. One such action could be; invoking the kdump followed by
68*f4febd00SPatrick Williams   the reboot.
692654988cSPatrick Williams
70*f4febd00SPatrick Williams- Note: NMI can be sent to the host in any state, not just at an unresponsive
712654988cSPatrick Williams  state.
722654988cSPatrick Williams
732654988cSPatrick Williams## Alternatives Considered
74*f4febd00SPatrick Williams
752654988cSPatrick WilliamsExtending the existing D-Bus interface state.Host namespace
762654988cSPatrick Williams(/openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/State/Host.interface.yaml)
77*f4febd00SPatrick Williamsto support new RequestedHostTransition property called Nmi. D-Bus back-end can
78*f4febd00SPatrick Williamsinternally invoke processor-specific target to invoke NMI and do associated
79*f4febd00SPatrick Williamsactions.
802654988cSPatrick Williams
812654988cSPatrick WilliamsThere were strong reasons to move away from the above approach.
822654988cSPatrick Williamsphosphor-state-manager has always been focused on the states of the BMC,
83*f4febd00SPatrick WilliamsChassis, and Host. NMI will be more of action against the host than a state.
842654988cSPatrick Williams
852654988cSPatrick Williams## Impacts
86*f4febd00SPatrick Williams
87*f4febd00SPatrick WilliamsThis implementation only needs to make some changes to the system state when NMI
88*f4febd00SPatrick Williamsis initiated irrespective of what host OS state is in, so it has minimal impact
89*f4febd00SPatrick Williamson the rest of the system.
902654988cSPatrick Williams
912654988cSPatrick Williams## Testing
92*f4febd00SPatrick Williams
932654988cSPatrick WilliamsDepending on the platform hardware design, this test requires a host OS kernel
942654988cSPatrick Williamsmodule driver to create hard lockup/hang and then check the scenario is good.
95*f4febd00SPatrick WilliamsAlso, one can invoke NMI to get the crash dump and confirm HOST received NMI via
96*f4febd00SPatrick Williamsconsole logs.
97