1# Memory preserving reboot and System Dump extraction flow on POWER Systems.
2
3Author: Dhruvaraj S <dhruvaraj@in.ibm.com>
4
5Created: 11/06/2019
6
7## Problem Description
8
9On POWER based servers, a hypervisor firmware manages and allocates
10resources to the logical partitions running on the server. If this hypervisor
11encounters an error and cannot continue with management operations, the server
12needs to be restarted. A typical server reboot will erase the content of the
13main memory with the current running configuration of the logical partitions
14and the data required for debugging the fault. Some hypervisors on the POWER
15based systems don't have access to a non-volatile storage to store this
16content after a failure. A warm reboot with preserving the main memory is needed
17on the POWER based servers to create a memory dump required for the
18debugging. This document explains the high-level flow of warm reboot and
19extraction of the resulting dump from the hypervisor memory.
20
21
22## Glossary
23
24- **Boot**: The process of initializing hardware components in a computer system
25and loading the operating system.
26
27- **Hostboot**: The firmware runs on the host processors and performs all
28processor, bus, and memory initialization on POWER based servers.
29[read more](https://github.com/open-power/docs/blob/master/hostboot/HostBoot_PG.md)
30
31- **Self Boot Engine (SBE)**: A microcontroller built into the host processors
32of POWER systems to assist in initializing the processor during the boot.
33It also acts as an entry point for several hardware access operations to the
34processor. [read more](https://sched.co/SPZP)
35
36- **Master Processor**: The processor which gets initialized first to execute
37boot firmware.
38
39- **POWER Hardware Abstraction Layer (PHAL)**: A software component on the BMC
40providing access to the POWER hardware.
41
42- **Hypervisor**: A hypervisor (or virtual machine monitor, VMM) is a computer
43software, firmware, or hardware that creates and runs virtual machines
44[read more](https://en.wikipedia.org/wiki/Hypervisor)
45
46- **System Dump**: A dump of main memory and hardware states for debugging the
47faults in hypervisor.
48
49- **Memory Preserving Reboot (MPR)**: A method of reboot with preserving the
50contents of the volatile memory.
51
52- **Terminate Immediate (TI)**: A condition when the hypervisor encountered
53a fatal error and cannot continue with the normal operations.
54
55- **Attention**: The signal generated by the hardware or the firmware for
56a specific event.
57
58- **Redfish**: The Redfish standard is a suite of specifications that deliver
59an industry-standard protocol providing a RESTful interface for the management
60of servers, storage, networking, and converged infrastructure.
61[Read More](https://en.wikipedia.org/wiki/Redfish_(specification))
62
63- **OCC**: An On-Chip Controller (OCC) is a co-processor that is embedded
64directly on the die of POWER processors. The OCC can be used to controls
65the processor frequency, power consumption, and temperature to maximize
66performance and minimize energy usage.
67
68[Read More](https://openpowerfoundation.org/on-chip-controller-occ/)
69- **Checkstop**: A severe error inside a processor core that causes a processor
70core to stop all processing activities.
71
72- **PNOR**: PNOR is a host NOR flash where the firmware is stored.
73
74## Background and References
75When the POWER based server encounters a fault and needs a restart,
76it alerts BMC to initiate a memory preserving reboot. BMC starts the reboot
77by informing the SBE on each of the processors. SBE stops the running cores and
78collects the hardware states in a specified format and store into the host
79memory. Once the data is collected, the SBE returns control to the BMC. BMC then
80initiates a memory preserved reboot. Once the system finished booting,
81the hypervisor collects the hardware data and memory contents to create
82a dump file in the host memory.
83
84## Requirements
85
86### Primary Requirements
87
88-   System dump should be collected irrespective of the availability of an
89    external entity to offload it at the time of a failure.
90
91-   It should provide a mechanism for the user to request a system dump.
92
93-   The server should boot back to runtime
94
95-   The hypervisor should send a special attention to BMC to notify about
96    a severe fault.
97
98-   BMC should receive special TI attention from hypervisor
99
100-   BMC should change the host state to 'DiagnosticMode.'
101
102-   BMC should inform SBE to start the memory preserving reboot and
103    collect the hardware data.
104
105-   Error log associated with dump needs to be part of the dump package
106
107-   A dump summary should be created with size and other details of the dump
108
109-   Once the dump is generated, the hypervisor should notify BMC.
110
111-   Hypervisor should offload the dump to BMC to transfer to an external client.
112
113-   Provide Redfish interfaces to manage dump
114
115-   A tool to collect the dump from the server.
116
117-   A method to parse the content of the dump.
118
119## Proposed Design
120
121### The flow
122The flow of the memory preserving reboot and system dump offloading
123![Memory preserving reboot and dump extraction flow](https://user-images.githubusercontent.com/16666879/77680635-40347000-6fba-11ea-8957-8f7fbc93f57e.jpeg)
124
125#### 1 - Server fault and notification to BMC
126When there is a fault, the hypervisor generates attention. The attention
127listener on the BMC detects the attention. In the case of OpenPOWER based Linux
128systems, an additional s0 interrupt will be sent to SBE to stop the cores
129immediately.
130
131#### 2 -  Analyze the error data.
132The attention listener on the BMC calls a chip-op to analyze the reason for the
133attention.
134
135#### 3 - Initiate System Dump
136Attention on the BMC sets the Diagnostic target for reboot to initiate a
137memory preserving reboot.
138
139#### 4 - Initiate Memory preserve transition
140following steps are executed as part of the reboot target
141          - Set the system state to DiagnosticMode
142          - Stop OCC
143          - Disable checkstop monitoring
144          - Issue enter_mpipl chip-op to each SBE
145
146#### 5 - SBE collects the hardware data
147Each SBE collects the architected states and stores it into a pre-defined
148location.
149
150#### 6 - BMC Start warm boot
151Once the SBE finishes the hardware collection, it does following to boot the
152system with preserving the memory.
153          - Reset VPNOR
154          - Enable watchdog
155          - Enable checkstop monitoring
156          - Run istep proc_select_boot_master
157          - Run istep sbe_config_update
158          - Issue continue_mpipl chip-op instead of start_cbs on the
159            master processor
160
161#### 7 - Hostboot booting
162Once SBE is started, it starts hostboot, hostboot copies the architected states
163to the right location, move the memory contents to create the dump.
164
165#### 8 - Hypervisor Formats dump and sends notification to BMC
166Once the hypervisor is started, it formats the dump and sends a notification to
167BMC through PLDM and with the dump size PLDM calls the dump manager
168interface to notify the dump. Dump manager creates a dBus object for the
169new dump, with status not offloaded and dump size.
170BMC web catches the object creation signal and notifies HMC.
171
172#### 9 - HMC send request to dump offload
173Once HMC is ready to offload, it creates NBD server and send dump offload
174request to BMC. BMCWeb creates an NBD client and NBD proxy to
175offload the dump. BMC dump manager make a PLDM call with dump id provided
176by hypervisor and the NBD device id. PLDM sends the offload request to the
177hypervisor with the dump id.
178
179#### 10 - Hypervisor starts dump offload
180Hypervisor start sending down the dump packets through DMA
181PLDM reads the DUMP and write to the NBD client endpoint
182The data reaches the NBD server on the HMC and get written to a dump file.
183
184#### 11 - Hypervisor sends down offload complete message
185Hypervisor sends down offload complete message to BMC and BMC sends it to HMC.
186The NBD endpoints are cleared.
187
188#### 12 - HMC verifies dump and send dump DELETE to BMC.
189HMC verifies the dump and send dump delete request to BMC
190BMC sends the dump delete message to hypervisor
191Hypervisor deletes dump in host memory.
192
193### Memory preserve reboot sequence.
194![Memory preserve reboot sequence](https://user-images.githubusercontent.com/16666879/77681484-64448100-6fbb-11ea-94b4-9f2256241b1c.jpeg)
195
196### Dump offload sequence
197![Dump offload sequence](https://user-images.githubusercontent.com/16666879/77681614-9e158780-6fbb-11ea-8fac-fbcffd563bef.jpeg)
198
199## Alternatives Considered
200Offload the dump from hypervisor to external dump collection application instead
201of offloading through BMC. But offloading though BMC is selected due to following
202reasons.
203     - BMC provides a common point for offloading all dumps
204     - During the prototyping, it is found that the offloading
205       through BMC gave better performance.
206     - Offloading through BMC has less development impact on the host.
207
208## Impacts
209- PLDM on BMC and Host - Extensions to PLDM implementation to pass type of dump,
210  and notification of new dump file to dump manager. [PLDM Design]([https://github.com/openbmc/docs/blob/7c8847e95203ebfaeddb82a6e1b9888bc6a11311/designs/pldm-stack.md])
211
212- Dump manager on BMC - BMC dump manager supports dump stored on BMC and that
213  needs to expanded to support host dumps.
214
215- External dump offloading application needs to support NBD based offload
216
217- Proposing a new redfish schema for dump operations. [Redfish Dump Proposal](https://lists.ozlabs.org/pipermail/openbmc/2019-December/019827.html)
218
219- BMC Web needs to implement new redfish specification for dump.
220
221- Add support to openpower-hw-diags to catch special attention and initiate
222  memory preserving reboot.
223
224- SBE needs to support a new operation to analyze the attention received
225  from the host. The interface update is yet to be published.
226
227## Testing
228- Unit test plans
229        - Test dump manager interfaces using busctl
230        - Test reboot by setting the diag mode target
231        - Test the SBE chip on using standalone calls
232        - Test PLDM by using hypervisor debug commands
233        - Test BMCWeb interfaces using curl
234
235- Integration testing by
236    - User-initiated dump testing, which invokes a memory preserving reboot
237      to collect dump.
238    - Initiate memory preserving reboot by injecting host error
239    - Offload dump collected in host.
240
241- System Dump test plan
242    - Automated tests to initiate and offload dump as part of test bucket.
243    - Both user-initiated and error injection should be attempted.
244