1# Memory preserving reboot and System Dump extraction flow on POWER Systems. 2 3 Author: Dhruvaraj S <dhruvaraj@in.ibm.com> 4 5 Primary Assignee: Dhruvaraj S 6 7 Created: 11/06/2019 8 9## Problem Description 10 11On POWER based servers, a hypervisor firmware manages and allocates 12resources to the logical partitions running on the server. If this hypervisor 13encounters an error and cannot continue with management operations, the server 14needs to be restarted. A typical server reboot will erase the content of the 15main memory with the current running configuration of the logical partitions 16and the data required for debugging the fault. Some hypervisors on the POWER 17based systems don't have access to a non-volatile storage to store this 18content after a failure. A warm reboot with preserving the main memory is needed 19on the POWER based servers to create a memory dump required for the 20debugging. This document explains the high-level flow of warm reboot and 21extraction of the resulting dump from the hypervisor memory. 22 23 24## Glossary 25 26- **Boot**: The process of initializing hardware components in a computer system 27and loading the operating system. 28 29- **Hostboot**: The firmware runs on the host processors and performs all 30processor, bus, and memory initialization on POWER based servers. 31[read more](https://github.com/open-power/docs/blob/master/hostboot/HostBoot_PG.md) 32 33- **Self Boot Engine (SBE)**: A microcontroller built into the host processors 34of POWER systems to assist in initializing the processor during the boot. 35It also acts as an entry point for several hardware access operations to the 36processor. [read more](https://sched.co/SPZP) 37 38- **Master Processor**: The processor which gets initialized first to execute 39boot firmware. 40 41- **POWER Hardware Abstraction Layer (PHAL)**: A software component on the BMC 42providing access to the POWER hardware. 43 44- **Hypervisor**: A hypervisor (or virtual machine monitor, VMM) is a computer 45software, firmware, or hardware that creates and runs virtual machines 46[read more](https://en.wikipedia.org/wiki/Hypervisor) 47 48- **System Dump**: A dump of main memory and hardware states for debugging the 49faults in hypervisor. 50 51- **Memory Preserving Reboot (MPR)**: A method of reboot with preserving the 52contents of the volatile memory. 53 54- **Terminate Immediate (TI)**: A condition when the hypervisor encountered 55a fatal error and cannot continue with the normal operations. 56 57- **Attention**: The signal generated by the hardware or the firmware for 58a specific event. 59 60- **Redfish**: The Redfish standard is a suite of specifications that deliver 61an industry-standard protocol providing a RESTful interface for the management 62of servers, storage, networking, and converged infrastructure. 63[Read More](https://en.wikipedia.org/wiki/Redfish_(specification)) 64 65- **OCC**: An On-Chip Controller (OCC) is a co-processor that is embedded 66directly on the die of POWER processors. The OCC can be used to controls 67the processor frequency, power consumption, and temperature to maximize 68performance and minimize energy usage. 69 70[Read More](https://openpowerfoundation.org/on-chip-controller-occ/) 71- **Checkstop**: A severe error inside a processor core that causes a processor 72core to stop all processing activities. 73 74- **PNOR**: PNOR is a host NOR flash where the firmware is stored. 75 76## Background and References 77When the POWER based server encounters a fault and needs a restart, 78it alerts BMC to initiate a memory preserving reboot. BMC starts the reboot 79by informing the SBE on each of the processors. SBE stops the running cores and 80collects the hardware states in a specified format and store into the host 81memory. Once the data is collected, the SBE returns control to the BMC. BMC then 82initiates a memory preserved reboot. Once the system finished booting, 83the hypervisor collects the hardware data and memory contents to create 84a dump file in the host memory. 85 86## Requirements 87 88### Primary Requirements 89 90- System dump should be collected irrespective of the availability of an 91 external entity to offload it at the time of a failure. 92 93- It should provide a mechanism for the user to request a system dump. 94 95- The server should boot back to runtime 96 97- The hypervisor should send a special attention to BMC to notify about 98 a severe fault. 99 100- BMC should receive special TI attention from hypervisor 101 102- BMC should change the host state to 'DiagnosticMode.' 103 104- BMC should inform SBE to start the memory preserving reboot and 105 collect the hardware data. 106 107- Error log associated with dump needs to be part of the dump package 108 109- A dump summary should be created with size and other details of the dump 110 111- Once the dump is generated, the hypervisor should notify BMC. 112 113- Hypervisor should offload the dump to BMC to transfer to an external client. 114 115- Provide Redfish interfaces to manage dump 116 117- A tool to collect the dump from the server. 118 119- A method to parse the content of the dump. 120 121## Proposed Design 122 123### The flow 124The flow of the memory preserving reboot and system dump offloading 125![Memory preserving reboot and dump extraction flow](https://user-images.githubusercontent.com/16666879/77680635-40347000-6fba-11ea-8957-8f7fbc93f57e.jpeg) 126 127#### 1 - Server fault and notification to BMC 128When there is a fault, the hypervisor generates attention. The attention 129listener on the BMC detects the attention. In the case of OpenPOWER based Linux 130systems, an additional s0 interrupt will be sent to SBE to stop the cores 131immediately. 132 133#### 2 - Analyze the error data. 134The attention listener on the BMC calls a chip-op to analyze the reason for the 135attention. 136 137#### 3 - Initiate System Dump 138Attention on the BMC sets the Diagnostic target for reboot to initiate a 139memory preserving reboot. 140 141#### 4 - Initiate Memory preserve transition 142following steps are executed as part of the reboot target 143 - Set the system state to DiagnosticMode 144 - Stop OCC 145 - Disable checkstop monitoring 146 - Issue enter_mpipl chip-op to each SBE 147 148#### 5 - SBE collects the hardware data 149Each SBE collects the architected states and stores it into a pre-defined 150location. 151 152#### 6 - BMC Start warm boot 153Once the SBE finishes the hardware collection, it does following to boot the 154system with preserving the memory. 155 - Reset VPNOR 156 - Enable watchdog 157 - Enable checkstop monitoring 158 - Run istep proc_select_boot_master 159 - Run istep sbe_config_update 160 - Issue continue_mpipl chip-op instead of start_cbs on the 161 master processor 162 163#### 7 - Hostboot booting 164Once SBE is started, it starts hostboot, hostboot copies the architected states 165to the right location, move the memory contents to create the dump. 166 167#### 8 - Hypervisor Formats dump and sends notification to BMC 168Once the hypervisor is started, it formats the dump and sends a notification to 169BMC through PLDM and with the dump size PLDM calls the dump manager 170interface to notify the dump. Dump manager creates a dBus object for the 171new dump, with status not offloaded and dump size. 172BMC web catches the object creation signal and notifies HMC. 173 174#### 9 - HMC send request to dump offload 175Once HMC is ready to offload, it creates NBD server and send dump offload 176request to BMC. BMCWeb creates an NBD client and NBD proxy to 177offload the dump. BMC dump manager make a PLDM call with dump id provided 178by hypervisor and the NBD device id. PLDM sends the offload request to the 179hypervisor with the dump id. 180 181#### 10 - Hypervisor starts dump offload 182Hypervisor start sending down the dump packets through DMA 183PLDM reads the DUMP and write to the NBD client endpoint 184The data reaches the NBD server on the HMC and get written to a dump file. 185 186#### 11 - Hypervisor sends down offload complete message 187Hypervisor sends down offload complete message to BMC and BMC sends it to HMC. 188The NBD endpoints are cleared. 189 190#### 12 - HMC verifies dump and send dump DELETE to BMC. 191HMC verifies the dump and send dump delete request to BMC 192BMC sends the dump delete message to hypervisor 193Hypervisor deletes dump in host memory. 194 195### Memory preserve reboot sequence. 196![Memory preserve reboot sequence](https://user-images.githubusercontent.com/16666879/77681484-64448100-6fbb-11ea-94b4-9f2256241b1c.jpeg) 197 198### Dump offload sequence 199![Dump offload sequence](https://user-images.githubusercontent.com/16666879/77681614-9e158780-6fbb-11ea-8fac-fbcffd563bef.jpeg) 200 201## Alternatives Considered 202Offload the dump from hypervisor to external dump collection application instead 203of offloading through BMC. But offloading though BMC is selected due to following 204reasons. 205 - BMC provides a common point for offloading all dumps 206 - During the prototyping, it is found that the offloading 207 through BMC gave better performance. 208 - Offloading through BMC has less development impact on the host. 209 210## Impacts 211- PLDM on BMC and Host - Extensions to PLDM implementation to pass type of dump, 212 and notification of new dump file to dump manager. [PLDM Design]([https://github.com/openbmc/docs/blob/7c8847e95203ebfaeddb82a6e1b9888bc6a11311/designs/pldm-stack.md]) 213 214- Dump manager on BMC - BMC dump manager supports dump stored on BMC and that 215 needs to expanded to support host dumps. 216 217- External dump offloading application needs to support NBD based offload 218 219- Proposing a new redfish schema for dump operations. [Redfish Dump Proposal](https://lists.ozlabs.org/pipermail/openbmc/2019-December/019827.html) 220 221- BMC Web needs to implement new redfish specification for dump. 222 223- Add support to openpower-hw-diags to catch special attention and initiate 224 memory preserving reboot. 225 226- SBE needs to support a new operation to analyze the attention received 227 from the host. The interface update is yet to be published. 228 229## Testing 230- Unit test plans 231 - Test dump manager interfaces using busctl 232 - Test reboot by setting the diag mode target 233 - Test the SBE chip on using standalone calls 234 - Test PLDM by using hypervisor debug commands 235 - Test BMCWeb interfaces using curl 236 237- Integration testing by 238 - User-initiated dump testing, which invokes a memory preserving reboot 239 to collect dump. 240 - Initiate memory preserving reboot by injecting host error 241 - Offload dump collected in host. 242 243- System Dump test plan 244 - Automated tests to initiate and offload dump as part of test bucket. 245 - Both user-initiated and error injection should be attempted. 246