1# Dump Manager Design 2 3Author: 4 Dhruvaraj Subhashchandran <dhruvaraj@in.ibm.com> 5 6Primary assignee: 7 Dhruvaraj Subhashchandran <dhruvaraj@in.ibm.com> 8 9Other contributors: 10 11Created: 12/12/2019 12 13## Problem Description 14During a crash or a host failure, an event monitor mechanism generates an error 15log, but the size of the error log is limited to few kilobytes, so all the data 16from the crash or failure may not fit into an error log. The additional data 17required for the debugging needs to be collected as a dump. 18The existing OpenBMC dump interfaces support only the dumps generated on 19the BMC and dump manager doesn't support download operations. 20 21## Glossary 22 23- **System Dump**: A dump of the Host's main memory and processor registers. 24 [Read More](https://en.wikipedia.org/wiki/Core_dump) 25- **Memory Preserving Reboot(MPR)**: A method of reboot with preserving the 26 contents of the volatile memory 27- **PLDM**: An interface and data model to access low-level platform inventory, 28 monitoring, control, event, and data/parameters transfer functions. 29 [ReadMore](https://github.com/openbmc/docs/blob/master/designs/pldm-stack.md) 30- **Machine Check Exception**: A severe error inside a processor core that 31 causes a processor core to stop all processing activities. 32- **BMCWeb**: An embedded webserver for OpenBMC. [More Info](https://github.com/openbmc/bmcweb/blob/master/README.md) 33 34## Background and References 35Various types of dumps are created based on the type and source of failure. 36The dump manager, which is orchestrating the collection and offload, needs to 37provide methods to create, store the dump details, and offload it. Additionally, 38some sources allow the dump to be extracted manually without a failure to 39understand the current state or analyze a suspected problem. 40 41 42## Requirements 43 44![Dump use cases - Users are examples, not a mandatory part of implementation](https://user-images.githubusercontent.com/16666879/70888651-d8f44080-2006-11ea-8596-ed4c321cfaa6.png) 45#### Dump manager needs to provide interfaces for 46- Create a dump: Initiate the creation of the dump, based on an error condition 47 or a user request. 48- List the dumps: List all dumps present in the BMC. 49- Get a dump: Offload the dump to an external entity. 50- Notify: Notify the dump manager that a new dump is created. 51- Delete the dump. 52- Mark a dump as offloaded to an external entity. 53- Set the dump policies like disabling a type of dump or dump overwriting policy. 54 55## Proposed Design 56There are various types of dumps; interfaces are standard for most of the dumps, 57but huge dumps which cannot be stored on the BMC needs additional support. 58This document will explain the design of different types of dumps. The dumps are 59classified based on where it is collected, stored, and how it is extracted. Two 60major types are 61 62- Collected by BMC and stored on BMC. 63- Collected and stored on an attached entity but offloaded through BMC. 64 65This proposal focuses on re-using the existing [phosphor-debug-collector](https://github.com/openbmc/phosphor-debug-collector), which 66collects the dumps from BMC. 67 68 69![phosphor-debug-collector](https://user-images.githubusercontent.com/16666879/72070844-7b56c980-3310-11ea-8d26-07d33b84b980.jpeg) 70 71#### Brief design points of existing phosphor-debug-collector 72- A create interface which assumes the type is BMC dump and returns an ID to the 73 caller for the user-initiated dumps. 74- An external request of dump is considered as a user-initiated BMC dump and 75 initiate BMC dump collection with a tool named dreport with type manual dump 76- The dreport create dump file in the dump path provided by the dump manager code. 77- A watch process is forked by the dump manager to catch the signal for the file 78 close on that path and assume the dump collection is completed. 79- The watch process calls an internal dbus interface to create the dump entry 80 with size and timestamp. 81- The path of dump is based on the predefined base path and the id of the dump. 82- When the request comes for offload, the file is downloaded from the dump base 83 path +id, no update in the entry whether dump is offloaded 84- Deleting a dump by deleting the entry and internal file also will be deleted. 85- There are system generated dumps based on error log or core dump, which works 86 similar to user-initiated dumps with the following difference. 87- No external create D-Bus interface is needed, but a monitor application will 88 monitor for specific conditions like the existence of a core file or a signal 89 for new error log creation of a few selected types. 90- Once the event occurred, the monitor will call an internal D-Bus interface 91 dump manager to create the dump with the type of dump. 92- The dump manager calls dreport with a dump type got from the monitor and write 93 data to a path based on dump id. 94 95#### Updates proposed to the existing phosphor-debug-collector. 96- External D-Bus interface needs to specify the type of the dump since a user 97 can request multiple types of dumps 98- Create will be returning an id which can be mapped to the actual dump once it 99 is created. 100- A Notify interface is provided for notifying the creation of a dump outside 101 the BMC but offloaded through BMC. 102- The InitiateOffload function will be implemented to download the dump. 103- Status of the dump, whether offloaded or not, will be added to the dump entry. 104 105### Dump manager interfaces. 106- Dump Manager DBus object provides interfaces for creating and managing dump 107 108- Interfaces 109 - **Create**: The interface to create a dump, called by clients to initiate 110 user-initiated dump. 111 - Type: Type of the dump needs to be created 112 113 - **Notify**: Notify the dump manager that a new dump is created. 114 - ID: ID of the dump, if not 0 this will be the external id of the dump 115 - Type: Type of dump that was created. 116 - Size: Size of the dump 117 118 ### Dump entry interfaces 119 - **InitiateOffload**: Initiate the offload of the dump. 120 - OffloadUri: The URI where the dump should be offloaded. 121 122#### The properties common to all dumps 123There will be a base dump entry with properties common to all types of dumps 124- ID: Id of the dump 125- Timestamp: Dump creation timestamp 126- Size: Total size of the dump 127- OffloadComplete: Set to true when offload is completed 128- OffloadURI: The URI for offloading the dump, set while initiating the offload. 129Specific types need to inherit from this common dump entry class 130and add specific properties. 131 132#### Additional propertries based on dump types 133 134##### BMC Dump 135- No Additional properties 136 137##### System Dump 138- External Source ID: ID provided by the Host, this id will be used for all 139 communication to the source of the dump, in this case, Host. 140 141 142### Flow of dumps collected and stored in the Host 143PLDM is provided as an example dump transport and notification mechanism 144between Host and BMC. 145 146- Create: Initiate methods to create the dump in Host. 147- Generating the dump in Host 148- Host notifies the creation of dump through PLDM to BMC. 149- PLDM call Notify to create the dump entry 150- InitiateOffload: Dump manager request Host to start offload 151- The Host sends the dump through PLDM, and PLDM on BMC sends it out. 152 153 154## Alternatives Considered 155- Offloading Host dumps through Host instead of BMC, but considered BMC option 156 due to following reasons 157 - The BMC is considered the "management path" of most servers and often 158 the Host is not connected to the desired network for the offload 159 location. 160 - BMC provides one common point for all dumps generated in the system 161 for external management appliance. 162 163## Impacts 164- The existing BMC dump interface needs to be re-used. The current interface is 165 not accepting a dump type, so a new interface to create the dump with type 166 will be provided for BMC dump also without changing the existing interface. 167- Modifying the BMC dump infrastructure to support additional dumps. 168- openpower-proc-control will be updated to call memory preserving chip-ops and 169 to handle memory preserving reboot on POWER platforms. 170- Additional system state to indicate the system is collecting debug data 171 While performing memory preserving reboot. 172 173 174## Testing 175- Unit tests to make sure the dump manager interfaces are working. 176- Following integration tests will be executed to make sure the dump manager 177is working as expected. 178 - Test creating host dumps and offloading it. 179 - Test deleting host dumps 180 - Create/List/Offload/Delete BMC dumps to make sure existing 181 dump manager functions are not broken. 182- Automated tests for dump Create/List/Offload/Delete to avoid regression. 183