149e1b0e3SDhruvaraj Subhashchandran# Dump Manager Design 22654988cSPatrick Williams 3f4febd00SPatrick WilliamsAuthor: Dhruvaraj Subhashchandran <dhruvaraj@in.ibm.com> 42654988cSPatrick Williams 52654988cSPatrick WilliamsOther contributors: 62654988cSPatrick Williams 72654988cSPatrick WilliamsCreated: 12/12/2019 82654988cSPatrick Williams 92654988cSPatrick Williams## Problem Description 10f4febd00SPatrick Williams 112654988cSPatrick WilliamsDuring a crash or a host failure, an event monitor mechanism generates an error 122654988cSPatrick Williamslog, but the size of the error log is limited to few kilobytes, so all the data 132654988cSPatrick Williamsfrom the crash or failure may not fit into an error log. The additional data 14f4febd00SPatrick Williamsrequired for the debugging needs to be collected as a dump. The existing OpenBMC 15f4febd00SPatrick Williamsdump interfaces support only the dumps generated on the BMC and dump manager 16f4febd00SPatrick Williamsdoesn't support download operations. 172654988cSPatrick Williams 182654988cSPatrick Williams## Glossary 192654988cSPatrick Williams 202654988cSPatrick Williams- **System Dump**: A dump of the Host's main memory and processor registers. 212654988cSPatrick Williams [Read More](https://en.wikipedia.org/wiki/Core_dump) 222654988cSPatrick Williams- **Memory Preserving Reboot(MPR)**: A method of reboot with preserving the 232654988cSPatrick Williams contents of the volatile memory 242654988cSPatrick Williams- **PLDM**: An interface and data model to access low-level platform inventory, 252654988cSPatrick Williams monitoring, control, event, and data/parameters transfer functions. 262654988cSPatrick Williams [ReadMore](https://github.com/openbmc/docs/blob/master/designs/pldm-stack.md) 272654988cSPatrick Williams- **Machine Check Exception**: A severe error inside a processor core that 282654988cSPatrick Williams causes a processor core to stop all processing activities. 29f4febd00SPatrick Williams- **BMCWeb**: An embedded webserver for OpenBMC. 30f4febd00SPatrick Williams [More Info](https://github.com/openbmc/bmcweb/blob/master/README.md) 312654988cSPatrick Williams 322654988cSPatrick Williams## Background and References 33f4febd00SPatrick Williams 34f4febd00SPatrick WilliamsVarious types of dumps are created based on the type and source of failure. The 35f4febd00SPatrick Williamsdump manager, which is orchestrating the collection and offload, needs to 362654988cSPatrick Williamsprovide methods to create, store the dump details, and offload it. Additionally, 372654988cSPatrick Williamssome sources allow the dump to be extracted manually without a failure to 3849e1b0e3SDhruvaraj Subhashchandranunderstand the current state or analyze a suspected problems. 392654988cSPatrick Williams 4049e1b0e3SDhruvaraj Subhashchandran### Type of dumps supported. 41f4febd00SPatrick Williams 4249e1b0e3SDhruvaraj SubhashchandranThese are some of the dumps supported by dump manager. 4349e1b0e3SDhruvaraj Subhashchandran 4449e1b0e3SDhruvaraj Subhashchandran#### BMC Dump 45f4febd00SPatrick Williams 4649e1b0e3SDhruvaraj SubhashchandranA dump collected when there is a failure in the BMC with various debug 4749e1b0e3SDhruvaraj Subhashchandraninformation. This type of dump can be generated by user too to get the current 4849e1b0e3SDhruvaraj Subhashchandranstate of the BMC. This dump gets collected on BMC and stored on BMC 4949e1b0e3SDhruvaraj Subhashchandran 5049e1b0e3SDhruvaraj Subhashchandran#### System Dump 51f4febd00SPatrick Williams 5249e1b0e3SDhruvaraj SubhashchandranA system dump is a collection of debugging information from the host, this may 5349e1b0e3SDhruvaraj Subhashchandraninclude host memory and/or register data. This dump can be initiated by BMC and 5449e1b0e3SDhruvaraj Subhashchandranthere can be system reboots while collecting the dump. Dump gets stored in the 5549e1b0e3SDhruvaraj Subhashchandranhost memory and offloaded through the BMC or get collected directly to BMC based 5649e1b0e3SDhruvaraj Subhashchandranon the size of dump contents and the available space on the BMC to store the 5749e1b0e3SDhruvaraj Subhashchandrandump. 5849e1b0e3SDhruvaraj Subhashchandran 5949e1b0e3SDhruvaraj Subhashchandran#### Resource dump 60f4febd00SPatrick Williams 6149e1b0e3SDhruvaraj SubhashchandranA special type of host dump is initiated and collected by the host based on the 6249e1b0e3SDhruvaraj Subhashchandranrequest from a user. No system state change may be necessary during the 6349e1b0e3SDhruvaraj Subhashchandrancollection of this kind of a dump. A resource indicator may be used to indicate 6449e1b0e3SDhruvaraj Subhashchandranwhat data to be collected. The content of the dump can be decided by the host. 6549e1b0e3SDhruvaraj SubhashchandranThis dump can be stored in host memory and offloaded through BMC or host can 66f4febd00SPatrick Williamssend this dump down to BMC once the collection is completed based on the size of 67f4febd00SPatrick Williamsthe dump and the availability of space on the BMC. 6849e1b0e3SDhruvaraj Subhashchandran 6949e1b0e3SDhruvaraj Subhashchandran#### Hostboot dump 70f4febd00SPatrick Williams 7149e1b0e3SDhruvaraj SubhashchandranA dump that can be collected during the boot failure of the host. This dump may 72f4febd00SPatrick Williamsor may not include the contents of the main memory and/or the processor 73f4febd00SPatrick Williamsregisters. 7449e1b0e3SDhruvaraj Subhashchandran 7549e1b0e3SDhruvaraj Subhashchandran#### Hardware dump 762654988cSPatrick Williams 77f4febd00SPatrick WilliamsThis dump can be collected during a critical failure on the hardware components 78f4febd00SPatrick Williamslike the processor while the host is booted and running. The host may stop 79f4febd00SPatrick Williamsduring this dump and may collect various processor states and/or memory contents 80f4febd00SPatrick Williamsto help to debug the failure. 8175345252SDhruvaraj Subhashchandran 822654988cSPatrick Williams## Requirements 832654988cSPatrick Williams 842654988cSPatrick Williams 85f4febd00SPatrick Williams 862654988cSPatrick Williams#### Dump manager needs to provide interfaces for 87f4febd00SPatrick Williams 882654988cSPatrick Williams- Create a dump: Initiate the creation of the dump, based on an error condition 892654988cSPatrick Williams or a user request. 902654988cSPatrick Williams- List the dumps: List all dumps present in the BMC. 912654988cSPatrick Williams- Get a dump: Offload the dump to an external entity. 922654988cSPatrick Williams- Notify: Notify the dump manager that a new dump is created. 932654988cSPatrick Williams- Delete the dump. 942654988cSPatrick Williams- Mark a dump as offloaded to an external entity. 95f4febd00SPatrick Williams- Set the dump policies like disabling a type of dump or dump overwriting 96f4febd00SPatrick Williams policy. 972654988cSPatrick Williams 982654988cSPatrick Williams## Proposed Design 99f4febd00SPatrick Williams 1002654988cSPatrick WilliamsThere are various types of dumps; interfaces are standard for most of the dumps, 101f4febd00SPatrick Williamsbut huge dumps which cannot be stored on the BMC needs additional support. This 102f4febd00SPatrick Williamsdocument will explain the design of different types of dumps. The dumps are 1032654988cSPatrick Williamsclassified based on where it is collected, stored, and how it is extracted. Two 1042654988cSPatrick Williamsmajor types are 1052654988cSPatrick Williams 1062654988cSPatrick Williams- Collected by BMC and stored on BMC. 1072654988cSPatrick Williams- Collected and stored on an attached entity but offloaded through BMC. 1082654988cSPatrick Williams 109f4febd00SPatrick WilliamsThis proposal focuses on re-using the existing 110f4febd00SPatrick Williams[phosphor-debug-collector](https://github.com/openbmc/phosphor-debug-collector), 111f4febd00SPatrick Williamswhich collects the dumps from BMC. 1122654988cSPatrick Williams 1132654988cSPatrick Williams 1142654988cSPatrick Williams 1152654988cSPatrick Williams#### Brief design points of existing phosphor-debug-collector 116f4febd00SPatrick Williams 1172654988cSPatrick Williams- A create interface which assumes the type is BMC dump and returns an ID to the 1182654988cSPatrick Williams caller for the user-initiated dumps. 1192654988cSPatrick Williams- An external request of dump is considered as a user-initiated BMC dump and 1202654988cSPatrick Williams initiate BMC dump collection with a tool named dreport with type manual dump 121f4febd00SPatrick Williams- The dreport create dump file in the dump path provided by the dump manager 122f4febd00SPatrick Williams code. 1232654988cSPatrick Williams- A watch process is forked by the dump manager to catch the signal for the file 1242654988cSPatrick Williams close on that path and assume the dump collection is completed. 1252654988cSPatrick Williams- The watch process calls an internal dbus interface to create the dump entry 1262654988cSPatrick Williams with size and timestamp. 1272654988cSPatrick Williams- The path of dump is based on the predefined base path and the id of the dump. 1282654988cSPatrick Williams- When the request comes for offload, the file is downloaded from the dump base 1292654988cSPatrick Williams path +id, no update in the entry whether dump is offloaded 1302654988cSPatrick Williams- Deleting a dump by deleting the entry and internal file also will be deleted. 1312654988cSPatrick Williams- There are system generated dumps based on error log or core dump, which works 1322654988cSPatrick Williams similar to user-initiated dumps with the following difference. 1332654988cSPatrick Williams- No external create D-Bus interface is needed, but a monitor application will 1342654988cSPatrick Williams monitor for specific conditions like the existence of a core file or a signal 1352654988cSPatrick Williams for new error log creation of a few selected types. 1362654988cSPatrick Williams- Once the event occurred, the monitor will call an internal D-Bus interface 1372654988cSPatrick Williams dump manager to create the dump with the type of dump. 1382654988cSPatrick Williams- The dump manager calls dreport with a dump type got from the monitor and write 1392654988cSPatrick Williams data to a path based on dump id. 1402654988cSPatrick Williams 1412654988cSPatrick Williams#### Updates proposed to the existing phosphor-debug-collector. 142f4febd00SPatrick Williams 1432654988cSPatrick Williams- External D-Bus interface needs to specify the type of the dump since a user 1442654988cSPatrick Williams can request multiple types of dumps 1452654988cSPatrick Williams- Create will be returning an id which can be mapped to the actual dump once it 1462654988cSPatrick Williams is created. 1472654988cSPatrick Williams- A Notify interface is provided for notifying the creation of a dump outside 1482654988cSPatrick Williams the BMC but offloaded through BMC. 1492654988cSPatrick Williams- The InitiateOffload function will be implemented to download the dump. 1502654988cSPatrick Williams- Status of the dump, whether offloaded or not, will be added to the dump entry. 1512654988cSPatrick Williams 1522654988cSPatrick Williams### Dump manager interfaces. 153f4febd00SPatrick Williams 1542654988cSPatrick Williams- Dump Manager DBus object provides interfaces for creating and managing dump 1552654988cSPatrick Williams 1562654988cSPatrick Williams- Interfaces 157f4febd00SPatrick Williams 1582654988cSPatrick Williams - **Create**: The interface to create a dump, called by clients to initiate 1592654988cSPatrick Williams user-initiated dump. 160f4febd00SPatrick Williams 161f4febd00SPatrick Williams - AdditionalData: The additional data, if any, for initiating the dump. The 162f4febd00SPatrick Williams key in this case should be an implementation specific enum defined for the 163f4febd00SPatrick Williams specific type of dump either in xyz or in a domain. The values can be 164f4febd00SPatrick Williams either a string or a 64 bit number. The enum-format string is required to 165f4febd00SPatrick Williams come from a parallel class with this specific Enum name. All of the Enum 166f4febd00SPatrick Williams strings should be in the format 167f4febd00SPatrick Williams 'domain.Dump.Create.CreateParameters.ParamName'. e.g.: { "key1": "value1", 168*67032dffSPeter Delevoryas "key2": "value2" } ends up in AdditionaData like: ["KEY1=value1", 169*67032dffSPeter Delevoryas "KEY2=value2"] 1702654988cSPatrick Williams 1712654988cSPatrick Williams - **Notify**: Notify the dump manager that a new dump is created. 1722654988cSPatrick Williams - ID: ID of the dump, if not 0 this will be the external id of the dump 1732654988cSPatrick Williams - Type: Type of dump that was created. 1742654988cSPatrick Williams - Size: Size of the dump 1752654988cSPatrick Williams 1762654988cSPatrick Williams### Dump entry interfaces 177f4febd00SPatrick Williams 1782654988cSPatrick Williams - **InitiateOffload**: Initiate the offload of the dump. 1792654988cSPatrick Williams - OffloadUri: The URI where the dump should be offloaded. 1802654988cSPatrick Williams 1812654988cSPatrick Williams#### The properties common to all dumps 182f4febd00SPatrick Williams 1832654988cSPatrick WilliamsThere will be a base dump entry with properties common to all types of dumps 184f4febd00SPatrick Williams 1852654988cSPatrick Williams- ID: Id of the dump 1862654988cSPatrick Williams- Timestamp: Dump creation timestamp 1872654988cSPatrick Williams- Size: Total size of the dump 1882654988cSPatrick Williams- OffloadComplete: Set to true when offload is completed 1892654988cSPatrick Williams- OffloadURI: The URI for offloading the dump, set while initiating the offload. 190f4febd00SPatrick Williams Specific types need to inherit from this common dump entry class and add 191f4febd00SPatrick Williams specific properties. 1922654988cSPatrick Williams 1932654988cSPatrick Williams#### Additional propertries based on dump types 1942654988cSPatrick Williams 1952654988cSPatrick Williams##### BMC Dump 196f4febd00SPatrick Williams 1972654988cSPatrick Williams- No Additional properties 1982654988cSPatrick Williams 1992654988cSPatrick Williams##### System Dump 200f4febd00SPatrick Williams 2012654988cSPatrick Williams- External Source ID: ID provided by the Host, this id will be used for all 2022654988cSPatrick Williams communication to the source of the dump, in this case, Host. 2032654988cSPatrick Williams 2042654988cSPatrick Williams### Flow of dumps collected and stored in the Host 205f4febd00SPatrick Williams 206f4febd00SPatrick WilliamsPLDM is provided as an example dump transport and notification mechanism between 207f4febd00SPatrick WilliamsHost and BMC. 2082654988cSPatrick Williams 2092654988cSPatrick Williams- Create: Initiate methods to create the dump in Host. 2102654988cSPatrick Williams- Generating the dump in Host 2112654988cSPatrick Williams- Host notifies the creation of dump through PLDM to BMC. 2122654988cSPatrick Williams- PLDM call Notify to create the dump entry 2132654988cSPatrick Williams- InitiateOffload: Dump manager request Host to start offload 2142654988cSPatrick Williams- The Host sends the dump through PLDM, and PLDM on BMC sends it out. 2152654988cSPatrick Williams 2162654988cSPatrick Williams## Alternatives Considered 217f4febd00SPatrick Williams 2182654988cSPatrick Williams- Offloading Host dumps through Host instead of BMC, but considered BMC option 219f4febd00SPatrick Williams due to following reasons - The BMC is considered the "management path" of most 220f4febd00SPatrick Williams servers and often the Host is not connected to the desired network for the 221f4febd00SPatrick Williams offload location. - BMC provides one common point for all dumps generated in 222f4febd00SPatrick Williams the system for external management appliance. 2232654988cSPatrick Williams 2242654988cSPatrick Williams## Impacts 225f4febd00SPatrick Williams 2262654988cSPatrick Williams- The existing BMC dump interface needs to be re-used. The current interface is 2272654988cSPatrick Williams not accepting a dump type, so a new interface to create the dump with type 2282654988cSPatrick Williams will be provided for BMC dump also without changing the existing interface. 2292654988cSPatrick Williams- Modifying the BMC dump infrastructure to support additional dumps. 2302654988cSPatrick Williams- openpower-proc-control will be updated to call memory preserving chip-ops and 2312654988cSPatrick Williams to handle memory preserving reboot on POWER platforms. 232f4febd00SPatrick Williams- Additional system state to indicate the system is collecting debug data While 233f4febd00SPatrick Williams performing memory preserving reboot. 2342654988cSPatrick Williams 2352654988cSPatrick Williams## Testing 236f4febd00SPatrick Williams 2372654988cSPatrick Williams- Unit tests to make sure the dump manager interfaces are working. 238f4febd00SPatrick Williams- Following integration tests will be executed to make sure the dump manager is 239f4febd00SPatrick Williams working as expected. - Test creating host dumps and offloading it. - Test 240f4febd00SPatrick Williams deleting host dumps - Create/List/Offload/Delete BMC dumps to make sure 241f4febd00SPatrick Williams existing dump manager functions are not broken. 2422654988cSPatrick Williams- Automated tests for dump Create/List/Offload/Delete to avoid regression. 243