1# Memory preserving reboot and System Dump extraction flow on POWER Systems.
2
3  Author: Dhruvaraj S <dhruvaraj@in.ibm.com>
4
5  Primary Assignee: Dhruvaraj S
6
7  Created: 11/06/2019
8
9## Problem Description
10
11On POWER based servers, a hypervisor firmware manages and allocates
12resources to the logical partitions running on the server. If this hypervisor
13encounters an error and cannot continue with management operations, the server
14needs to be restarted. A typical server reboot will erase the content of the
15main memory with the current running configuration of the logical partitions
16and the data required for debugging the fault. Some hypervisors on the POWER
17based systems don't have access to a non-volatile storage to store this
18content after a failure. A warm reboot with preserving the main memory is needed
19on the POWER based servers to create a memory dump required for the
20debugging. This document explains the high-level flow of warm reboot and
21extraction of the resulting dump from the hypervisor memory.
22
23
24## Glossary
25
26- **Boot**: The process of initializing hardware components in a computer system
27and loading the operating system.
28
29- **Hostboot**: The firmware runs on the host processors and performs all
30processor, bus, and memory initialization on POWER based servers.
31[read more](https://github.com/open-power/docs/blob/master/hostboot/HostBoot_PG.md)
32
33- **Self Boot Engine (SBE)**: A microcontroller built into the host processors
34of POWER systems to assist in initializing the processor during the boot.
35It also acts as an entry point for several hardware access operations to the
36processor. [read more](https://sched.co/SPZP)
37
38- **Master Processor**: The processor which gets initialized first to execute
39boot firmware.
40
41- **POWER Hardware Abstraction Layer (PHAL)**: A software component on the BMC
42providing access to the POWER hardware.
43
44- **Hypervisor**: A hypervisor (or virtual machine monitor, VMM) is a computer
45software, firmware, or hardware that creates and runs virtual machines
46[read more](https://en.wikipedia.org/wiki/Hypervisor)
47
48- **System Dump**: A dump of main memory and hardware states for debugging the
49faults in hypervisor.
50
51- **Memory Preserving Reboot (MPR)**: A method of reboot with preserving the
52contents of the volatile memory.
53
54- **Terminate Immediate (TI)**: A condition when the hypervisor encountered
55a fatal error and cannot continue with the normal operations.
56
57- **Attention**: The signal generated by the hardware or the firmware for
58a specific event.
59
60- **Redfish**: The Redfish standard is a suite of specifications that deliver
61an industry-standard protocol providing a RESTful interface for the management
62of servers, storage, networking, and converged infrastructure.
63[Read More](https://en.wikipedia.org/wiki/Redfish_(specification))
64
65- **OCC**: An On-Chip Controller (OCC) is a co-processor that is embedded
66directly on the die of POWER processors. The OCC can be used to controls
67the processor frequency, power consumption, and temperature to maximize
68performance and minimize energy usage.
69
70[Read More](https://openpowerfoundation.org/on-chip-controller-occ/)
71- **Checkstop**: A severe error inside a processor core that causes a processor
72core to stop all processing activities.
73
74- **PNOR**: PNOR is a host NOR flash where the firmware is stored.
75
76## Background and References
77When the POWER based server encounters a fault and needs a restart,
78it alerts BMC to initiate a memory preserving reboot. BMC starts the reboot
79by informing the SBE on each of the processors. SBE stops the running cores and
80collects the hardware states in a specified format and store into the host
81memory. Once the data is collected, the SBE returns control to the BMC. BMC then
82initiates a memory preserved reboot. Once the system finished booting,
83the hypervisor collects the hardware data and memory contents to create
84a dump file in the host memory.
85
86## Requirements
87
88### Primary Requirements
89
90-   System dump should be collected irrespective of the availability of an
91    external entity to offload it at the time of a failure.
92
93-   It should provide a mechanism for the user to request a system dump.
94
95-   The server should boot back to runtime
96
97-   The hypervisor should send a special attention to BMC to notify about
98    a severe fault.
99
100-   BMC should receive special TI attention from hypervisor
101
102-   BMC should change the host state to 'DiagnosticMode.'
103
104-   BMC should inform SBE to start the memory preserving reboot and
105    collect the hardware data.
106
107-   Error log associated with dump needs to be part of the dump package
108
109-   A dump summary should be created with size and other details of the dump
110
111-   Once the dump is generated, the hypervisor should notify BMC.
112
113-   Hypervisor should offload the dump to BMC to transfer to an external client.
114
115-   Provide Redfish interfaces to manage dump
116
117-   A tool to collect the dump from the server.
118
119-   A method to parse the content of the dump.
120
121## Proposed Design
122
123### The flow
124The flow of the memory preserving reboot and system dump offloading
125![Memory preserving reboot and dump extraction flow](https://user-images.githubusercontent.com/16666879/77680635-40347000-6fba-11ea-8957-8f7fbc93f57e.jpeg)
126
127#### 1 - Server fault and notification to BMC
128When there is a fault, the hypervisor generates attention. The attention
129listener on the BMC detects the attention. In the case of OpenPOWER based Linux
130systems, an additional s0 interrupt will be sent to SBE to stop the cores
131immediately.
132
133#### 2 -  Analyze the error data.
134The attention listener on the BMC calls a chip-op to analyze the reason for the
135attention.
136
137#### 3 - Initiate System Dump
138Attention on the BMC sets the Diagnostic target for reboot to initiate a
139memory preserving reboot.
140
141#### 4 - Initiate Memory preserve transition
142following steps are executed as part of the reboot target
143          - Set the system state to DiagnosticMode
144          - Stop OCC
145          - Disable checkstop monitoring
146          - Issue enter_mpipl chip-op to each SBE
147
148#### 5 - SBE collects the hardware data
149Each SBE collects the architected states and stores it into a pre-defined
150location.
151
152#### 6 - BMC Start warm boot
153Once the SBE finishes the hardware collection, it does following to boot the
154system with preserving the memory.
155          - Reset VPNOR
156          - Enable watchdog
157          - Enable checkstop monitoring
158          - Run istep proc_select_boot_master
159          - Run istep sbe_config_update
160          - Issue continue_mpipl chip-op instead of start_cbs on the
161            master processor
162
163#### 7 - Hostboot booting
164Once SBE is started, it starts hostboot, hostboot copies the architected states
165to the right location, move the memory contents to create the dump.
166
167#### 8 - Hypervisor Formats dump and sends notification to BMC
168Once the hypervisor is started, it formats the dump and sends a notification to
169BMC through PLDM and with the dump size PLDM calls the dump manager
170interface to notify the dump. Dump manager creates a dBus object for the
171new dump, with status not offloaded and dump size.
172BMC web catches the object creation signal and notifies HMC.
173
174#### 9 - HMC send request to dump offload
175Once HMC is ready to offload, it creates NBD server and send dump offload
176request to BMC. BMCWeb creates an NBD client and NBD proxy to
177offload the dump. BMC dump manager make a PLDM call with dump id provided
178by hypervisor and the NBD device id. PLDM sends the offload request to the
179hypervisor with the dump id.
180
181#### 10 - Hypervisor starts dump offload
182Hypervisor start sending down the dump packets through DMA
183PLDM reads the DUMP and write to the NBD client endpoint
184The data reaches the NBD server on the HMC and get written to a dump file.
185
186#### 11 - Hypervisor sends down offload complete message
187Hypervisor sends down offload complete message to BMC and BMC sends it to HMC.
188The NBD endpoints are cleared.
189
190#### 12 - HMC verifies dump and send dump DELETE to BMC.
191HMC verifies the dump and send dump delete request to BMC
192BMC sends the dump delete message to hypervisor
193Hypervisor deletes dump in host memory.
194
195### Memory preserve reboot sequence.
196![Memory preserve reboot sequence](https://user-images.githubusercontent.com/16666879/77681484-64448100-6fbb-11ea-94b4-9f2256241b1c.jpeg)
197
198### Dump offload sequence
199![Dump offload sequence](https://user-images.githubusercontent.com/16666879/77681614-9e158780-6fbb-11ea-8fac-fbcffd563bef.jpeg)
200
201## Alternatives Considered
202Offload the dump from hypervisor to external dump collection application instead
203of offloading through BMC. But offloading though BMC is selected due to following
204reasons.
205     - BMC provides a common point for offloading all dumps
206     - During the prototyping, it is found that the offloading
207       through BMC gave better performance.
208     - Offloading through BMC has less development impact on the host.
209
210## Impacts
211- PLDM on BMC and Host - Extensions to PLDM implementation to pass type of dump,
212  and notification of new dump file to dump manager. [PLDM Design]([https://github.com/openbmc/docs/blob/7c8847e95203ebfaeddb82a6e1b9888bc6a11311/designs/pldm-stack.md])
213
214- Dump manager on BMC - BMC dump manager supports dump stored on BMC and that
215  needs to expanded to support host dumps.
216
217- External dump offloading application needs to support NBD based offload
218
219- Proposing a new redfish schema for dump operations. [Redfish Dump Proposal](https://lists.ozlabs.org/pipermail/openbmc/2019-December/019827.html)
220
221- BMC Web needs to implement new redfish specification for dump.
222
223- Add support to openpower-hw-diags to catch special attention and initiate
224  memory preserving reboot.
225
226- SBE needs to support a new operation to analyze the attention received
227  from the host. The interface update is yet to be published.
228
229## Testing
230- Unit test plans
231        - Test dump manager interfaces using busctl
232        - Test reboot by setting the diag mode target
233        - Test the SBE chip on using standalone calls
234        - Test PLDM by using hypervisor debug commands
235        - Test BMCWeb interfaces using curl
236
237- Integration testing by
238    - User-initiated dump testing, which invokes a memory preserving reboot
239      to collect dump.
240    - Initiate memory preserving reboot by injecting host error
241    - Offload dump collected in host.
242
243- System Dump test plan
244    - Automated tests to initiate and offload dump as part of test bucket.
245    - Both user-initiated and error injection should be attempted.
246