1# Dump Manager Design 2 3Author: Dhruvaraj Subhashchandran <dhruvaraj@in.ibm.com> 4 5Other contributors: 6 7Created: 12/12/2019 8 9## Problem Description 10 11During a crash or a host failure, an event monitor mechanism generates an error 12log, but the size of the error log is limited to few kilobytes, so all the data 13from the crash or failure may not fit into an error log. The additional data 14required for the debugging needs to be collected as a dump. The existing OpenBMC 15dump interfaces support only the dumps generated on the BMC and dump manager 16doesn't support download operations. 17 18## Glossary 19 20- **System Dump**: A dump of the Host's main memory and processor registers. 21 [Read More](https://en.wikipedia.org/wiki/Core_dump) 22- **Memory Preserving Reboot(MPR)**: A method of reboot with preserving the 23 contents of the volatile memory 24- **PLDM**: An interface and data model to access low-level platform inventory, 25 monitoring, control, event, and data/parameters transfer functions. 26 [ReadMore](https://github.com/openbmc/docs/blob/master/designs/pldm-stack.md) 27- **Machine Check Exception**: A severe error inside a processor core that 28 causes a processor core to stop all processing activities. 29- **BMCWeb**: An embedded webserver for OpenBMC. 30 [More Info](https://github.com/openbmc/bmcweb/blob/master/README.md) 31 32## Background and References 33 34Various types of dumps are created based on the type and source of failure. The 35dump manager, which is orchestrating the collection and offload, needs to 36provide methods to create, store the dump details, and offload it. Additionally, 37some sources allow the dump to be extracted manually without a failure to 38understand the current state or analyze a suspected problems. 39 40### Type of dumps supported. 41 42These are some of the dumps supported by dump manager. 43 44#### BMC Dump 45 46A dump collected when there is a failure in the BMC with various debug 47information. This type of dump can be generated by user too to get the current 48state of the BMC. This dump gets collected on BMC and stored on BMC 49 50#### System Dump 51 52A system dump is a collection of debugging information from the host, this may 53include host memory and/or register data. This dump can be initiated by BMC and 54there can be system reboots while collecting the dump. Dump gets stored in the 55host memory and offloaded through the BMC or get collected directly to BMC based 56on the size of dump contents and the available space on the BMC to store the 57dump. 58 59#### Resource dump 60 61A special type of host dump is initiated and collected by the host based on the 62request from a user. No system state change may be necessary during the 63collection of this kind of a dump. A resource indicator may be used to indicate 64what data to be collected. The content of the dump can be decided by the host. 65This dump can be stored in host memory and offloaded through BMC or host can 66send this dump down to BMC once the collection is completed based on the size of 67the dump and the availability of space on the BMC. 68 69#### Hostboot dump 70 71A dump that can be collected during the boot failure of the host. This dump may 72or may not include the contents of the main memory and/or the processor 73registers. 74 75#### Hardware dump 76 77This dump can be collected during a critical failure on the hardware components 78like the processor while the host is booted and running. The host may stop 79during this dump and may collect various processor states and/or memory contents 80to help to debug the failure. 81 82## Requirements 83 84![Dump use cases - Users are examples, not a mandatory part of implementation](https://user-images.githubusercontent.com/16666879/70888651-d8f44080-2006-11ea-8596-ed4c321cfaa6.png) 85 86#### Dump manager needs to provide interfaces for 87 88- Create a dump: Initiate the creation of the dump, based on an error condition 89 or a user request. 90- List the dumps: List all dumps present in the BMC. 91- Get a dump: Offload the dump to an external entity. 92- Notify: Notify the dump manager that a new dump is created. 93- Delete the dump. 94- Mark a dump as offloaded to an external entity. 95- Set the dump policies like disabling a type of dump or dump overwriting 96 policy. 97 98## Proposed Design 99 100There are various types of dumps; interfaces are standard for most of the dumps, 101but huge dumps which cannot be stored on the BMC needs additional support. This 102document will explain the design of different types of dumps. The dumps are 103classified based on where it is collected, stored, and how it is extracted. Two 104major types are 105 106- Collected by BMC and stored on BMC. 107- Collected and stored on an attached entity but offloaded through BMC. 108 109This proposal focuses on re-using the existing 110[phosphor-debug-collector](https://github.com/openbmc/phosphor-debug-collector), 111which collects the dumps from BMC. 112 113![phosphor-debug-collector](https://user-images.githubusercontent.com/16666879/72070844-7b56c980-3310-11ea-8d26-07d33b84b980.jpeg) 114 115#### Brief design points of existing phosphor-debug-collector 116 117- A create interface which assumes the type is BMC dump and returns an ID to the 118 caller for the user-initiated dumps. 119- An external request of dump is considered as a user-initiated BMC dump and 120 initiate BMC dump collection with a tool named dreport with type manual dump 121- The dreport create dump file in the dump path provided by the dump manager 122 code. 123- A watch process is forked by the dump manager to catch the signal for the file 124 close on that path and assume the dump collection is completed. 125- The watch process calls an internal dbus interface to create the dump entry 126 with size and timestamp. 127- The path of dump is based on the predefined base path and the id of the dump. 128- When the request comes for offload, the file is downloaded from the dump base 129 path +id, no update in the entry whether dump is offloaded 130- Deleting a dump by deleting the entry and internal file also will be deleted. 131- There are system generated dumps based on error log or core dump, which works 132 similar to user-initiated dumps with the following difference. 133- No external create D-Bus interface is needed, but a monitor application will 134 monitor for specific conditions like the existence of a core file or a signal 135 for new error log creation of a few selected types. 136- Once the event occurred, the monitor will call an internal D-Bus interface 137 dump manager to create the dump with the type of dump. 138- The dump manager calls dreport with a dump type got from the monitor and write 139 data to a path based on dump id. 140 141#### Updates proposed to the existing phosphor-debug-collector. 142 143- External D-Bus interface needs to specify the type of the dump since a user 144 can request multiple types of dumps 145- Create will be returning an id which can be mapped to the actual dump once it 146 is created. 147- A Notify interface is provided for notifying the creation of a dump outside 148 the BMC but offloaded through BMC. 149- The InitiateOffload function will be implemented to download the dump. 150- Status of the dump, whether offloaded or not, will be added to the dump entry. 151 152### Dump manager interfaces. 153 154- Dump Manager DBus object provides interfaces for creating and managing dump 155 156- Interfaces 157 158 - **Create**: The interface to create a dump, called by clients to initiate 159 user-initiated dump. 160 161 - AdditionalData: The additional data, if any, for initiating the dump. The 162 key in this case should be an implementation specific enum defined for the 163 specific type of dump either in xyz or in a domain. The values can be 164 either a string or a 64 bit number. The enum-format string is required to 165 come from a parallel class with this specific Enum name. All of the Enum 166 strings should be in the format 167 'domain.Dump.Create.CreateParameters.ParamName'. e.g.: { "key1": "value1", 168 "key2": "value2" } ends up in AdditionaData like: ["KEY1=value1", 169 "KEY2=value2"] 170 171 - **Notify**: Notify the dump manager that a new dump is created. 172 - ID: ID of the dump, if not 0 this will be the external id of the dump 173 - Type: Type of dump that was created. 174 - Size: Size of the dump 175 176### Dump entry interfaces 177 178 - **InitiateOffload**: Initiate the offload of the dump. 179 - OffloadUri: The URI where the dump should be offloaded. 180 181#### The properties common to all dumps 182 183There will be a base dump entry with properties common to all types of dumps 184 185- ID: Id of the dump 186- Timestamp: Dump creation timestamp 187- Size: Total size of the dump 188- OffloadComplete: Set to true when offload is completed 189- OffloadURI: The URI for offloading the dump, set while initiating the offload. 190 Specific types need to inherit from this common dump entry class and add 191 specific properties. 192 193#### Additional propertries based on dump types 194 195##### BMC Dump 196 197- No Additional properties 198 199##### System Dump 200 201- External Source ID: ID provided by the Host, this id will be used for all 202 communication to the source of the dump, in this case, Host. 203 204### Flow of dumps collected and stored in the Host 205 206PLDM is provided as an example dump transport and notification mechanism between 207Host and BMC. 208 209- Create: Initiate methods to create the dump in Host. 210- Generating the dump in Host 211- Host notifies the creation of dump through PLDM to BMC. 212- PLDM call Notify to create the dump entry 213- InitiateOffload: Dump manager request Host to start offload 214- The Host sends the dump through PLDM, and PLDM on BMC sends it out. 215 216## Alternatives Considered 217 218- Offloading Host dumps through Host instead of BMC, but considered BMC option 219 due to following reasons - The BMC is considered the "management path" of most 220 servers and often the Host is not connected to the desired network for the 221 offload location. - BMC provides one common point for all dumps generated in 222 the system for external management appliance. 223 224## Impacts 225 226- The existing BMC dump interface needs to be re-used. The current interface is 227 not accepting a dump type, so a new interface to create the dump with type 228 will be provided for BMC dump also without changing the existing interface. 229- Modifying the BMC dump infrastructure to support additional dumps. 230- openpower-proc-control will be updated to call memory preserving chip-ops and 231 to handle memory preserving reboot on POWER platforms. 232- Additional system state to indicate the system is collecting debug data While 233 performing memory preserving reboot. 234 235## Testing 236 237- Unit tests to make sure the dump manager interfaces are working. 238- Following integration tests will be executed to make sure the dump manager is 239 working as expected. - Test creating host dumps and offloading it. - Test 240 deleting host dumps - Create/List/Offload/Delete BMC dumps to make sure 241 existing dump manager functions are not broken. 242- Automated tests for dump Create/List/Offload/Delete to avoid regression. 243