1# Memory preserving reboot and System Dump extraction flow on POWER Systems. 2 3Author: Dhruvaraj S <dhruvaraj@in.ibm.com> 4 5Created: 11/06/2019 6 7## Problem Description 8 9On POWER based servers, a hypervisor firmware manages and allocates resources to 10the logical partitions running on the server. If this hypervisor encounters an 11error and cannot continue with management operations, the server needs to be 12restarted. A typical server reboot will erase the content of the main memory 13with the current running configuration of the logical partitions and the data 14required for debugging the fault. Some hypervisors on the POWER based systems 15don't have access to a non-volatile storage to store this content after a 16failure. A warm reboot with preserving the main memory is needed on the POWER 17based servers to create a memory dump required for the debugging. This document 18explains the high-level flow of warm reboot and extraction of the resulting dump 19from the hypervisor memory. 20 21## Glossary 22 23- **Boot**: The process of initializing hardware components in a computer system 24 and loading the operating system. 25 26- **Hostboot**: The firmware runs on the host processors and performs all 27 processor, bus, and memory initialization on POWER based servers. 28 [read more](https://github.com/open-power/docs/blob/master/hostboot/HostBoot_PG.md) 29 30- **Self Boot Engine (SBE)**: A microcontroller built into the host processors 31 of POWER systems to assist in initializing the processor during the boot. It 32 also acts as an entry point for several hardware access operations to the 33 processor. [read more](https://sched.co/SPZP) 34 35- **Master Processor**: The processor which gets initialized first to execute 36 boot firmware. 37 38- **POWER Hardware Abstraction Layer (PHAL)**: A software component on the BMC 39 providing access to the POWER hardware. 40 41- **Hypervisor**: A hypervisor (or virtual machine monitor, VMM) is a computer 42 software, firmware, or hardware that creates and runs virtual machines 43 [read more](https://en.wikipedia.org/wiki/Hypervisor) 44 45- **System Dump**: A dump of main memory and hardware states for debugging the 46 faults in hypervisor. 47 48- **Memory Preserving Reboot (MPR)**: A method of reboot with preserving the 49 contents of the volatile memory. 50 51- **Terminate Immediate (TI)**: A condition when the hypervisor encountered a 52 fatal error and cannot continue with the normal operations. 53 54- **Attention**: The signal generated by the hardware or the firmware for a 55 specific event. 56 57- **Redfish**: The Redfish standard is a suite of specifications that deliver an 58 industry-standard protocol providing a RESTful interface for the management of 59 servers, storage, networking, and converged infrastructure. 60 [Read More](<https://en.wikipedia.org/wiki/Redfish_(specification)>) 61 62- **OCC**: An On-Chip Controller (OCC) is a co-processor that is embedded 63 directly on the die of POWER processors. The OCC can be used to controls the 64 processor frequency, power consumption, and temperature to maximize 65 performance and minimize energy usage. 66 67[Read More](https://openpowerfoundation.org/on-chip-controller-occ/) 68 69- **Checkstop**: A severe error inside a processor core that causes a processor 70 core to stop all processing activities. 71 72- **PNOR**: PNOR is a host NOR flash where the firmware is stored. 73 74## Background and References 75 76When the POWER based server encounters a fault and needs a restart, it alerts 77BMC to initiate a memory preserving reboot. BMC starts the reboot by informing 78the SBE on each of the processors. SBE stops the running cores and collects the 79hardware states in a specified format and store into the host memory. Once the 80data is collected, the SBE returns control to the BMC. BMC then initiates a 81memory preserved reboot. Once the system finished booting, the hypervisor 82collects the hardware data and memory contents to create a dump file in the host 83memory. 84 85## Requirements 86 87### Primary Requirements 88 89- System dump should be collected irrespective of the availability of an 90 external entity to offload it at the time of a failure. 91 92- It should provide a mechanism for the user to request a system dump. 93 94- The server should boot back to runtime 95 96- The hypervisor should send a special attention to BMC to notify about a severe 97 fault. 98 99- BMC should receive special TI attention from hypervisor 100 101- BMC should change the host state to 'DiagnosticMode.' 102 103- BMC should inform SBE to start the memory preserving reboot and collect the 104 hardware data. 105 106- Error log associated with dump needs to be part of the dump package 107 108- A dump summary should be created with size and other details of the dump 109 110- Once the dump is generated, the hypervisor should notify BMC. 111 112- Hypervisor should offload the dump to BMC to transfer to an external client. 113 114- Provide Redfish interfaces to manage dump 115 116- A tool to collect the dump from the server. 117 118- A method to parse the content of the dump. 119 120## Proposed Design 121 122### The flow 123 124The flow of the memory preserving reboot and system dump offloading 125 126 127#### 1 - Server fault and notification to BMC 128 129When there is a fault, the hypervisor generates attention. The attention 130listener on the BMC detects the attention. In the case of OpenPOWER based Linux 131systems, an additional s0 interrupt will be sent to SBE to stop the cores 132immediately. 133 134#### 2 - Analyze the error data. 135 136The attention listener on the BMC calls a chip-op to analyze the reason for the 137attention. 138 139#### 3 - Initiate System Dump 140 141Attention on the BMC sets the Diagnostic target for reboot to initiate a memory 142preserving reboot. 143 144#### 4 - Initiate Memory preserve transition 145 146following steps are executed as part of the reboot target - Set the system state 147to DiagnosticMode - Stop OCC - Disable checkstop monitoring - Issue enter_mpipl 148chip-op to each SBE 149 150#### 5 - SBE collects the hardware data 151 152Each SBE collects the architected states and stores it into a pre-defined 153location. 154 155#### 6 - BMC Start warm boot 156 157Once the SBE finishes the hardware collection, it does following to boot the 158system with preserving the memory. - Reset VPNOR - Enable watchdog - Enable 159checkstop monitoring - Run istep proc_select_boot_master - Run istep 160sbe_config_update - Issue continue_mpipl chip-op instead of start_cbs on the 161master processor 162 163#### 7 - Hostboot booting 164 165Once SBE is started, it starts hostboot, hostboot copies the architected states 166to the right location, move the memory contents to create the dump. 167 168#### 8 - Hypervisor Formats dump and sends notification to BMC 169 170Once the hypervisor is started, it formats the dump and sends a notification to 171BMC through PLDM and with the dump size PLDM calls the dump manager interface to 172notify the dump. Dump manager creates a dBus object for the new dump, with 173status not offloaded and dump size. BMC web catches the object creation signal 174and notifies HMC. 175 176#### 9 - HMC send request to dump offload 177 178Once HMC is ready to offload, it creates NBD server and send dump offload 179request to BMC. BMCWeb creates an NBD client and NBD proxy to offload the dump. 180BMC dump manager make a PLDM call with dump id provided by hypervisor and the 181NBD device id. PLDM sends the offload request to the hypervisor with the dump 182id. 183 184#### 10 - Hypervisor starts dump offload 185 186Hypervisor start sending down the dump packets through DMA PLDM reads the DUMP 187and write to the NBD client endpoint The data reaches the NBD server on the HMC 188and get written to a dump file. 189 190#### 11 - Hypervisor sends down offload complete message 191 192Hypervisor sends down offload complete message to BMC and BMC sends it to HMC. 193The NBD endpoints are cleared. 194 195#### 12 - HMC verifies dump and send dump DELETE to BMC. 196 197HMC verifies the dump and send dump delete request to BMC BMC sends the dump 198delete message to hypervisor Hypervisor deletes dump in host memory. 199 200### Memory preserve reboot sequence. 201 202 203 204### Dump offload sequence 205 206 207 208## Alternatives Considered 209 210Offload the dump from hypervisor to external dump collection application instead 211of offloading through BMC. But offloading though BMC is selected due to 212following reasons. - BMC provides a common point for offloading all dumps - 213During the prototyping, it is found that the offloading through BMC gave better 214performance. - Offloading through BMC has less development impact on the host. 215 216## Impacts 217 218- PLDM on BMC and Host - Extensions to PLDM implementation to pass type of dump, 219 and notification of new dump file to dump manager. 220 [PLDM Design]([https://github.com/openbmc/docs/blob/7c8847e95203ebfaeddb82a6e1b9888bc6a11311/designs/pldm-stack.md]) 221 222- Dump manager on BMC - BMC dump manager supports dump stored on BMC and that 223 needs to expanded to support host dumps. 224 225- External dump offloading application needs to support NBD based offload 226 227- Proposing a new redfish schema for dump operations. 228 [Redfish Dump Proposal](https://lists.ozlabs.org/pipermail/openbmc/2019-December/019827.html) 229 230- BMC Web needs to implement new redfish specification for dump. 231 232- Add support to openpower-hw-diags to catch special attention and initiate 233 memory preserving reboot. 234 235- SBE needs to support a new operation to analyze the attention received from 236 the host. The interface update is yet to be published. 237 238## Testing 239 240- Unit test plans - Test dump manager interfaces using busctl - Test reboot by 241 setting the diag mode target - Test the SBE chip on using standalone calls - 242 Test PLDM by using hypervisor debug commands - Test BMCWeb interfaces using 243 curl 244 245- Integration testing by 246 247 - User-initiated dump testing, which invokes a memory preserving reboot to 248 collect dump. 249 - Initiate memory preserving reboot by injecting host error 250 - Offload dump collected in host. 251 252- System Dump test plan 253 - Automated tests to initiate and offload dump as part of test bucket. 254 - Both user-initiated and error injection should be attempted. 255