Lines Matching +full:system +full:- +full:on +full:- +full:chip

1 # Memory preserving reboot and System Dump extraction flow on POWER Systems.
9 On POWER based servers, a hypervisor firmware manages and allocates resources to
10 the logical partitions running on the server. If this hypervisor encounters an
14 required for debugging the fault. Some hypervisors on the POWER based systems
15 don't have access to a non-volatile storage to store this content after a
16 failure. A warm reboot with preserving the main memory is needed on the POWER
18 explains the high-level flow of warm reboot and extraction of the resulting dump
23 - **Boot**: The process of initializing hardware components in a computer system
24 and loading the operating system.
26 - **Hostboot**: The firmware runs on the host processors and performs all
27 processor, bus, and memory initialization on POWER based servers.
28 [read more](https://github.com/open-power/docs/blob/master/hostboot/HostBoot_PG.md)
30 - **Self Boot Engine (SBE)**: A microcontroller built into the host processors
35 - **Master Processor**: The processor which gets initialized first to execute
38 - **POWER Hardware Abstraction Layer (PHAL)**: A software component on the BMC
41 - **Hypervisor**: A hypervisor (or virtual machine monitor, VMM) is a computer
45 - **System Dump**: A dump of main memory and hardware states for debugging the
48 - **Memory Preserving Reboot (MPR)**: A method of reboot with preserving the
51 - **Terminate Immediate (TI)**: A condition when the hypervisor encountered a
54 - **Attention**: The signal generated by the hardware or the firmware for a
57 - **Redfish**: The Redfish standard is a suite of specifications that deliver an
58 industry-standard protocol providing a RESTful interface for the management of
62 - **OCC**: An On-Chip Controller (OCC) is a co-processor that is embedded
63 directly on the die of POWER processors. The OCC can be used to controls the
67 [Read More](https://openpowerfoundation.org/on-chip-controller-occ/)
69 - **Checkstop**: A severe error inside a processor core that causes a processor
72 - **PNOR**: PNOR is a host NOR flash where the firmware is stored.
78 the SBE on each of the processors. SBE stops the running cores and collects the
81 memory preserved reboot. Once the system finished booting, the hypervisor
89 - System dump should be collected irrespective of the availability of an
92 - It should provide a mechanism for the user to request a system dump.
94 - The server should boot back to runtime
96 - The hypervisor should send a special attention to BMC to notify about a severe
99 - BMC should receive special TI attention from hypervisor
101 - BMC should change the host state to 'DiagnosticMode.'
103 - BMC should inform SBE to start the memory preserving reboot and collect the
106 - Error log associated with dump needs to be part of the dump package
108 - A dump summary should be created with size and other details of the dump
110 - Once the dump is generated, the hypervisor should notify BMC.
112 - Hypervisor should offload the dump to BMC to transfer to an external client.
114 - Provide Redfish interfaces to manage dump
116 - A tool to collect the dump from the server.
118 - A method to parse the content of the dump.
124 The flow of the memory preserving reboot and system dump offloading
125 …t and dump extraction flow](https://user-images.githubusercontent.com/16666879/77680635-40347000-6…
127 #### 1 - Server fault and notification to BMC
130 listener on the BMC detects the attention. In the case of OpenPOWER based Linux
134 #### 2 - Analyze the error data.
136 The attention listener on the BMC calls a chip-op to analyze the reason for the
139 #### 3 - Initiate System Dump
141 Attention on the BMC sets the Diagnostic target for reboot to initiate a memory
144 #### 4 - Initiate Memory preserve transition
146 following steps are executed as part of the reboot target - Set the system state
147 to DiagnosticMode - Stop OCC - Disable checkstop monitoring - Issue enter_mpipl
148 chip-op to each SBE
150 #### 5 - SBE collects the hardware data
152 Each SBE collects the architected states and stores it into a pre-defined
155 #### 6 - BMC Start warm boot
158 system with preserving the memory. - Reset VPNOR - Enable watchdog - Enable
159 checkstop monitoring - Run istep proc_select_boot_master - Run istep
160 sbe_config_update - Issue continue_mpipl chip-op instead of start_cbs on the
163 #### 7 - Hostboot booting
168 #### 8 - Hypervisor Formats dump and sends notification to BMC
176 #### 9 - HMC send request to dump offload
184 #### 10 - Hypervisor starts dump offload
187 and write to the NBD client endpoint The data reaches the NBD server on the HMC
190 #### 11 - Hypervisor sends down offload complete message
195 #### 12 - HMC verifies dump and send dump DELETE to BMC.
202 …y preserve reboot sequence](https://user-images.githubusercontent.com/16666879/77681484-64448100-6…
206 ![Dump offload sequence](https://user-images.githubusercontent.com/16666879/77681614-9e158780-6fbb-
212 following reasons. - BMC provides a common point for offloading all dumps -
214 performance. - Offloading through BMC has less development impact on the host.
218 - PLDM on BMC and Host - Extensions to PLDM implementation to pass type of dump,
220 …tps://github.com/openbmc/docs/blob/7c8847e95203ebfaeddb82a6e1b9888bc6a11311/designs/pldm-stack.md])
222 - Dump manager on BMC - BMC dump manager supports dump stored on BMC and that
225 - External dump offloading application needs to support NBD based offload
227 - Proposing a new redfish schema for dump operations.
228 [Redfish Dump Proposal](https://lists.ozlabs.org/pipermail/openbmc/2019-December/019827.html)
230 - BMC Web needs to implement new redfish specification for dump.
232 - Add support to openpower-hw-diags to catch special attention and initiate
235 - SBE needs to support a new operation to analyze the attention received from
240 - Unit test plans - Test dump manager interfaces using busctl - Test reboot by
241 setting the diag mode target - Test the SBE chip on using standalone calls -
242 Test PLDM by using hypervisor debug commands - Test BMCWeb interfaces using
245 - Integration testing by
247 - User-initiated dump testing, which invokes a memory preserving reboot to
249 - Initiate memory preserving reboot by injecting host error
250 - Offload dump collected in host.
252 - System Dump test plan
253 - Automated tests to initiate and offload dump as part of test bucket.
254 - Both user-initiated and error injection should be attempted.