docs/designs/power-systems-memory-preserving-reboot.md

# Memory preserving reboot and System Dump extraction flow on POWER Systems.

Author: Dhruvaraj S <dhruvaraj@in.ibm.com>

Created: 11/06/2019

## Problem Description

On POWER based servers, a hypervisor firmware manages and allocates
resources to the logical partitions running on the server. If this hypervisor
encounters an error and cannot continue with management operations, the server
needs to be restarted. A typical server reboot will erase the content of the
main memory with the current running configuration of the logical partitions
and the data required for debugging the fault. Some hypervisors on the POWER
based systems don't have access to a non-volatile storage to store this
content after a failure. A warm reboot with preserving the main memory is needed
on the POWER based servers to create a memory dump required for the
debugging. This document explains the high-level flow of warm reboot and
extraction of the resulting dump from the hypervisor memory.


## Glossary

- **Boot**: The process of initializing hardware components in a computer system
and loading the operating system.

- **Hostboot**: The firmware runs on the host processors and performs all
processor, bus, and memory initialization on POWER based servers.
[read more](https://github.com/open-power/docs/blob/master/hostboot/HostBoot_PG.md)

- **Self Boot Engine (SBE)**: A microcontroller built into the host processors
of POWER systems to assist in initializing the processor during the boot.
It also acts as an entry point for several hardware access operations to the
processor. [read more](https://sched.co/SPZP)

- **Master Processor**: The processor which gets initialized first to execute
boot firmware.

- **POWER Hardware Abstraction Layer (PHAL)**: A software component on the BMC
providing access to the POWER hardware.

- **Hypervisor**: A hypervisor (or virtual machine monitor, VMM) is a computer
software, firmware, or hardware that creates and runs virtual machines
[read more](https://en.wikipedia.org/wiki/Hypervisor)

- **System Dump**: A dump of main memory and hardware states for debugging the
faults in hypervisor.

- **Memory Preserving Reboot (MPR)**: A method of reboot with preserving the
contents of the volatile memory.

- **Terminate Immediate (TI)**: A condition when the hypervisor encountered
a fatal error and cannot continue with the normal operations.

- **Attention**: The signal generated by the hardware or the firmware for
a specific event.

- **Redfish**: The Redfish standard is a suite of specifications that deliver
an industry-standard protocol providing a RESTful interface for the management
of servers, storage, networking, and converged infrastructure.
[Read More](https://en.wikipedia.org/wiki/Redfish_(specification))

- **OCC**: An On-Chip Controller (OCC) is a co-processor that is embedded
directly on the die of POWER processors. The OCC can be used to controls
the processor frequency, power consumption, and temperature to maximize
performance and minimize energy usage.

[Read More](https://openpowerfoundation.org/on-chip-controller-occ/)
- **Checkstop**: A severe error inside a processor core that causes a processor
core to stop all processing activities.

- **PNOR**: PNOR is a host NOR flash where the firmware is stored.

## Background and References
When the POWER based server encounters a fault and needs a restart,
it alerts BMC to initiate a memory preserving reboot. BMC starts the reboot
by informing the SBE on each of the processors. SBE stops the running cores and
collects the hardware states in a specified format and store into the host
memory. Once the data is collected, the SBE returns control to the BMC. BMC then
initiates a memory preserved reboot. Once the system finished booting,
the hypervisor collects the hardware data and memory contents to create
a dump file in the host memory.

## Requirements

### Primary Requirements

-   System dump should be collected irrespective of the availability of an
    external entity to offload it at the time of a failure.

-   It should provide a mechanism for the user to request a system dump.

-   The server should boot back to runtime

-   The hypervisor should send a special attention to BMC to notify about
    a severe fault.

-   BMC should receive special TI attention from hypervisor

-   BMC should change the host state to 'DiagnosticMode.'

-   BMC should inform SBE to start the memory preserving reboot and
    collect the hardware data.

-   Error log associated with dump needs to be part of the dump package

-   A dump summary should be created with size and other details of the dump

-   Once the dump is generated, the hypervisor should notify BMC.

-   Hypervisor should offload the dump to BMC to transfer to an external client.

-   Provide Redfish interfaces to manage dump

-   A tool to collect the dump from the server.

-   A method to parse the content of the dump.

## Proposed Design

### The flow
The flow of the memory preserving reboot and system dump offloading
![Memory preserving reboot and dump extraction flow](https://user-images.githubusercontent.com/16666879/77680635-40347000-6fba-11ea-8957-8f7fbc93f57e.jpeg)

#### 1 - Server fault and notification to BMC
When there is a fault, the hypervisor generates attention. The attention
listener on the BMC detects the attention. In the case of OpenPOWER based Linux
systems, an additional s0 interrupt will be sent to SBE to stop the cores
immediately.

#### 2 -  Analyze the error data.
The attention listener on the BMC calls a chip-op to analyze the reason for the
attention.

#### 3 - Initiate System Dump
Attention on the BMC sets the Diagnostic target for reboot to initiate a
memory preserving reboot.

#### 4 - Initiate Memory preserve transition
following steps are executed as part of the reboot target
          - Set the system state to DiagnosticMode
          - Stop OCC
          - Disable checkstop monitoring
          - Issue enter_mpipl chip-op to each SBE

#### 5 - SBE collects the hardware data
Each SBE collects the architected states and stores it into a pre-defined
location.

#### 6 - BMC Start warm boot
Once the SBE finishes the hardware collection, it does following to boot the
system with preserving the memory.
          - Reset VPNOR
          - Enable watchdog
          - Enable checkstop monitoring
          - Run istep proc_select_boot_master
          - Run istep sbe_config_update
          - Issue continue_mpipl chip-op instead of start_cbs on the
            master processor

#### 7 - Hostboot booting
Once SBE is started, it starts hostboot, hostboot copies the architected states
to the right location, move the memory contents to create the dump.

#### 8 - Hypervisor Formats dump and sends notification to BMC
Once the hypervisor is started, it formats the dump and sends a notification to
BMC through PLDM and with the dump size PLDM calls the dump manager
interface to notify the dump. Dump manager creates a dBus object for the
new dump, with status not offloaded and dump size.
BMC web catches the object creation signal and notifies HMC.

#### 9 - HMC send request to dump offload
Once HMC is ready to offload, it creates NBD server and send dump offload
request to BMC. BMCWeb creates an NBD client and NBD proxy to
offload the dump. BMC dump manager make a PLDM call with dump id provided
by hypervisor and the NBD device id. PLDM sends the offload request to the
hypervisor with the dump id.

#### 10 - Hypervisor starts dump offload
Hypervisor start sending down the dump packets through DMA
PLDM reads the DUMP and write to the NBD client endpoint
The data reaches the NBD server on the HMC and get written to a dump file.

#### 11 - Hypervisor sends down offload complete message
Hypervisor sends down offload complete message to BMC and BMC sends it to HMC.
The NBD endpoints are cleared.

#### 12 - HMC verifies dump and send dump DELETE to BMC.
HMC verifies the dump and send dump delete request to BMC
BMC sends the dump delete message to hypervisor
Hypervisor deletes dump in host memory.

### Memory preserve reboot sequence.
![Memory preserve reboot sequence](https://user-images.githubusercontent.com/16666879/77681484-64448100-6fbb-11ea-94b4-9f2256241b1c.jpeg)

### Dump offload sequence
![Dump offload sequence](https://user-images.githubusercontent.com/16666879/77681614-9e158780-6fbb-11ea-8fac-fbcffd563bef.jpeg)

## Alternatives Considered
Offload the dump from hypervisor to external dump collection application instead
of offloading through BMC. But offloading though BMC is selected due to following
reasons.
     - BMC provides a common point for offloading all dumps
     - During the prototyping, it is found that the offloading
       through BMC gave better performance.
     - Offloading through BMC has less development impact on the host.

## Impacts
- PLDM on BMC and Host - Extensions to PLDM implementation to pass type of dump,
  and notification of new dump file to dump manager. [PLDM Design]([https://github.com/openbmc/docs/blob/7c8847e95203ebfaeddb82a6e1b9888bc6a11311/designs/pldm-stack.md])

- Dump manager on BMC - BMC dump manager supports dump stored on BMC and that
  needs to expanded to support host dumps.

- External dump offloading application needs to support NBD based offload

- Proposing a new redfish schema for dump operations. [Redfish Dump Proposal](https://lists.ozlabs.org/pipermail/openbmc/2019-December/019827.html)

- BMC Web needs to implement new redfish specification for dump.

- Add support to openpower-hw-diags to catch special attention and initiate
  memory preserving reboot.

- SBE needs to support a new operation to analyze the attention received
  from the host. The interface update is yet to be published.

## Testing
- Unit test plans
        - Test dump manager interfaces using busctl
        - Test reboot by setting the diag mode target
        - Test the SBE chip on using standalone calls
        - Test PLDM by using hypervisor debug commands
        - Test BMCWeb interfaces using curl

- Integration testing by
    - User-initiated dump testing, which invokes a memory preserving reboot
      to collect dump.
    - Initiate memory preserving reboot by injecting host error
    - Offload dump collected in host.

- System Dump test plan
    - Automated tests to initiate and offload dump as part of test bucket.
    - Both user-initiated and error injection should be attempted.