xref: /openbmc/docs/designs/dump-manager.md (revision 67032dffe31f98a8638927f74a7a35990d6a1fbc)
149e1b0e3SDhruvaraj Subhashchandran# Dump Manager Design
22654988cSPatrick Williams
3f4febd00SPatrick WilliamsAuthor: Dhruvaraj Subhashchandran <dhruvaraj@in.ibm.com>
42654988cSPatrick Williams
52654988cSPatrick WilliamsOther contributors:
62654988cSPatrick Williams
72654988cSPatrick WilliamsCreated: 12/12/2019
82654988cSPatrick Williams
92654988cSPatrick Williams## Problem Description
10f4febd00SPatrick Williams
112654988cSPatrick WilliamsDuring a crash or a host failure, an event monitor mechanism generates an error
122654988cSPatrick Williamslog, but the size of the error log is limited to few kilobytes, so all the data
132654988cSPatrick Williamsfrom the crash or failure may not fit into an error log. The additional data
14f4febd00SPatrick Williamsrequired for the debugging needs to be collected as a dump. The existing OpenBMC
15f4febd00SPatrick Williamsdump interfaces support only the dumps generated on the BMC and dump manager
16f4febd00SPatrick Williamsdoesn't support download operations.
172654988cSPatrick Williams
182654988cSPatrick Williams## Glossary
192654988cSPatrick Williams
202654988cSPatrick Williams- **System Dump**: A dump of the Host's main memory and processor registers.
212654988cSPatrick Williams  [Read More](https://en.wikipedia.org/wiki/Core_dump)
222654988cSPatrick Williams- **Memory Preserving Reboot(MPR)**: A method of reboot with preserving the
232654988cSPatrick Williams  contents of the volatile memory
242654988cSPatrick Williams- **PLDM**: An interface and data model to access low-level platform inventory,
252654988cSPatrick Williams  monitoring, control, event, and data/parameters transfer functions.
262654988cSPatrick Williams  [ReadMore](https://github.com/openbmc/docs/blob/master/designs/pldm-stack.md)
272654988cSPatrick Williams- **Machine Check Exception**: A severe error inside a processor core that
282654988cSPatrick Williams  causes a processor core to stop all processing activities.
29f4febd00SPatrick Williams- **BMCWeb**: An embedded webserver for OpenBMC.
30f4febd00SPatrick Williams  [More Info](https://github.com/openbmc/bmcweb/blob/master/README.md)
312654988cSPatrick Williams
322654988cSPatrick Williams## Background and References
33f4febd00SPatrick Williams
34f4febd00SPatrick WilliamsVarious types of dumps are created based on the type and source of failure. The
35f4febd00SPatrick Williamsdump manager, which is orchestrating the collection and offload, needs to
362654988cSPatrick Williamsprovide methods to create, store the dump details, and offload it. Additionally,
372654988cSPatrick Williamssome sources allow the dump to be extracted manually without a failure to
3849e1b0e3SDhruvaraj Subhashchandranunderstand the current state or analyze a suspected problems.
392654988cSPatrick Williams
4049e1b0e3SDhruvaraj Subhashchandran### Type of dumps supported.
41f4febd00SPatrick Williams
4249e1b0e3SDhruvaraj SubhashchandranThese are some of the dumps supported by dump manager.
4349e1b0e3SDhruvaraj Subhashchandran
4449e1b0e3SDhruvaraj Subhashchandran#### BMC Dump
45f4febd00SPatrick Williams
4649e1b0e3SDhruvaraj SubhashchandranA dump collected when there is a failure in the BMC with various debug
4749e1b0e3SDhruvaraj Subhashchandraninformation. This type of dump can be generated by user too to get the current
4849e1b0e3SDhruvaraj Subhashchandranstate of the BMC. This dump gets collected on BMC and stored on BMC
4949e1b0e3SDhruvaraj Subhashchandran
5049e1b0e3SDhruvaraj Subhashchandran#### System Dump
51f4febd00SPatrick Williams
5249e1b0e3SDhruvaraj SubhashchandranA system dump is a collection of debugging information from the host, this may
5349e1b0e3SDhruvaraj Subhashchandraninclude host memory and/or register data. This dump can be initiated by BMC and
5449e1b0e3SDhruvaraj Subhashchandranthere can be system reboots while collecting the dump. Dump gets stored in the
5549e1b0e3SDhruvaraj Subhashchandranhost memory and offloaded through the BMC or get collected directly to BMC based
5649e1b0e3SDhruvaraj Subhashchandranon the size of dump contents and the available space on the BMC to store the
5749e1b0e3SDhruvaraj Subhashchandrandump.
5849e1b0e3SDhruvaraj Subhashchandran
5949e1b0e3SDhruvaraj Subhashchandran#### Resource dump
60f4febd00SPatrick Williams
6149e1b0e3SDhruvaraj SubhashchandranA special type of host dump is initiated and collected by the host based on the
6249e1b0e3SDhruvaraj Subhashchandranrequest from a user. No system state change may be necessary during the
6349e1b0e3SDhruvaraj Subhashchandrancollection of this kind of a dump. A resource indicator may be used to indicate
6449e1b0e3SDhruvaraj Subhashchandranwhat data to be collected. The content of the dump can be decided by the host.
6549e1b0e3SDhruvaraj SubhashchandranThis dump can be stored in host memory and offloaded through BMC or host can
66f4febd00SPatrick Williamssend this dump down to BMC once the collection is completed based on the size of
67f4febd00SPatrick Williamsthe dump and the availability of space on the BMC.
6849e1b0e3SDhruvaraj Subhashchandran
6949e1b0e3SDhruvaraj Subhashchandran#### Hostboot dump
70f4febd00SPatrick Williams
7149e1b0e3SDhruvaraj SubhashchandranA dump that can be collected during the boot failure of the host. This dump may
72f4febd00SPatrick Williamsor may not include the contents of the main memory and/or the processor
73f4febd00SPatrick Williamsregisters.
7449e1b0e3SDhruvaraj Subhashchandran
7549e1b0e3SDhruvaraj Subhashchandran#### Hardware dump
762654988cSPatrick Williams
77f4febd00SPatrick WilliamsThis dump can be collected during a critical failure on the hardware components
78f4febd00SPatrick Williamslike the processor while the host is booted and running. The host may stop
79f4febd00SPatrick Williamsduring this dump and may collect various processor states and/or memory contents
80f4febd00SPatrick Williamsto help to debug the failure.
8175345252SDhruvaraj Subhashchandran
822654988cSPatrick Williams## Requirements
832654988cSPatrick Williams
842654988cSPatrick Williams![Dump use cases - Users are examples, not a mandatory part of implementation](https://user-images.githubusercontent.com/16666879/70888651-d8f44080-2006-11ea-8596-ed4c321cfaa6.png)
85f4febd00SPatrick Williams
862654988cSPatrick Williams#### Dump manager needs to provide interfaces for
87f4febd00SPatrick Williams
882654988cSPatrick Williams- Create a dump: Initiate the creation of the dump, based on an error condition
892654988cSPatrick Williams  or a user request.
902654988cSPatrick Williams- List the dumps: List all dumps present in the BMC.
912654988cSPatrick Williams- Get a dump: Offload the dump to an external entity.
922654988cSPatrick Williams- Notify: Notify the dump manager that a new dump is created.
932654988cSPatrick Williams- Delete the dump.
942654988cSPatrick Williams- Mark a dump as offloaded to an external entity.
95f4febd00SPatrick Williams- Set the dump policies like disabling a type of dump or dump overwriting
96f4febd00SPatrick Williams  policy.
972654988cSPatrick Williams
982654988cSPatrick Williams## Proposed Design
99f4febd00SPatrick Williams
1002654988cSPatrick WilliamsThere are various types of dumps; interfaces are standard for most of the dumps,
101f4febd00SPatrick Williamsbut huge dumps which cannot be stored on the BMC needs additional support. This
102f4febd00SPatrick Williamsdocument will explain the design of different types of dumps. The dumps are
1032654988cSPatrick Williamsclassified based on where it is collected, stored, and how it is extracted. Two
1042654988cSPatrick Williamsmajor types are
1052654988cSPatrick Williams
1062654988cSPatrick Williams- Collected by BMC and stored on BMC.
1072654988cSPatrick Williams- Collected and stored on an attached entity but offloaded through BMC.
1082654988cSPatrick Williams
109f4febd00SPatrick WilliamsThis proposal focuses on re-using the existing
110f4febd00SPatrick Williams[phosphor-debug-collector](https://github.com/openbmc/phosphor-debug-collector),
111f4febd00SPatrick Williamswhich collects the dumps from BMC.
1122654988cSPatrick Williams
1132654988cSPatrick Williams![phosphor-debug-collector](https://user-images.githubusercontent.com/16666879/72070844-7b56c980-3310-11ea-8d26-07d33b84b980.jpeg)
1142654988cSPatrick Williams
1152654988cSPatrick Williams#### Brief design points of existing phosphor-debug-collector
116f4febd00SPatrick Williams
1172654988cSPatrick Williams- A create interface which assumes the type is BMC dump and returns an ID to the
1182654988cSPatrick Williams  caller for the user-initiated dumps.
1192654988cSPatrick Williams- An external request of dump is considered as a user-initiated BMC dump and
1202654988cSPatrick Williams  initiate BMC dump collection with a tool named dreport with type manual dump
121f4febd00SPatrick Williams- The dreport create dump file in the dump path provided by the dump manager
122f4febd00SPatrick Williams  code.
1232654988cSPatrick Williams- A watch process is forked by the dump manager to catch the signal for the file
1242654988cSPatrick Williams  close on that path and assume the dump collection is completed.
1252654988cSPatrick Williams- The watch process calls an internal dbus interface to create the dump entry
1262654988cSPatrick Williams  with size and timestamp.
1272654988cSPatrick Williams- The path of dump is based on the predefined base path and the id of the dump.
1282654988cSPatrick Williams- When the request comes for offload, the file is downloaded from the dump base
1292654988cSPatrick Williams  path +id, no update in the entry whether dump is offloaded
1302654988cSPatrick Williams- Deleting a dump by deleting the entry and internal file also will be deleted.
1312654988cSPatrick Williams- There are system generated dumps based on error log or core dump, which works
1322654988cSPatrick Williams  similar to user-initiated dumps with the following difference.
1332654988cSPatrick Williams- No external create D-Bus interface is needed, but a monitor application will
1342654988cSPatrick Williams  monitor for specific conditions like the existence of a core file or a signal
1352654988cSPatrick Williams  for new error log creation of a few selected types.
1362654988cSPatrick Williams- Once the event occurred, the monitor will call an internal D-Bus interface
1372654988cSPatrick Williams  dump manager to create the dump with the type of dump.
1382654988cSPatrick Williams- The dump manager calls dreport with a dump type got from the monitor and write
1392654988cSPatrick Williams  data to a path based on dump id.
1402654988cSPatrick Williams
1412654988cSPatrick Williams#### Updates proposed to the existing phosphor-debug-collector.
142f4febd00SPatrick Williams
1432654988cSPatrick Williams- External D-Bus interface needs to specify the type of the dump since a user
1442654988cSPatrick Williams  can request multiple types of dumps
1452654988cSPatrick Williams- Create will be returning an id which can be mapped to the actual dump once it
1462654988cSPatrick Williams  is created.
1472654988cSPatrick Williams- A Notify interface is provided for notifying the creation of a dump outside
1482654988cSPatrick Williams  the BMC but offloaded through BMC.
1492654988cSPatrick Williams- The InitiateOffload function will be implemented to download the dump.
1502654988cSPatrick Williams- Status of the dump, whether offloaded or not, will be added to the dump entry.
1512654988cSPatrick Williams
1522654988cSPatrick Williams### Dump manager interfaces.
153f4febd00SPatrick Williams
1542654988cSPatrick Williams- Dump Manager DBus object provides interfaces for creating and managing dump
1552654988cSPatrick Williams
1562654988cSPatrick Williams- Interfaces
157f4febd00SPatrick Williams
1582654988cSPatrick Williams  - **Create**: The interface to create a dump, called by clients to initiate
1592654988cSPatrick Williams    user-initiated dump.
160f4febd00SPatrick Williams
161f4febd00SPatrick Williams    - AdditionalData: The additional data, if any, for initiating the dump. The
162f4febd00SPatrick Williams      key in this case should be an implementation specific enum defined for the
163f4febd00SPatrick Williams      specific type of dump either in xyz or in a domain. The values can be
164f4febd00SPatrick Williams      either a string or a 64 bit number. The enum-format string is required to
165f4febd00SPatrick Williams      come from a parallel class with this specific Enum name. All of the Enum
166f4febd00SPatrick Williams      strings should be in the format
167f4febd00SPatrick Williams      'domain.Dump.Create.CreateParameters.ParamName'. e.g.: { "key1": "value1",
168*67032dffSPeter Delevoryas      "key2": "value2" } ends up in AdditionaData like: ["KEY1=value1",
169*67032dffSPeter Delevoryas      "KEY2=value2"]
1702654988cSPatrick Williams
1712654988cSPatrick Williams  - **Notify**: Notify the dump manager that a new dump is created.
1722654988cSPatrick Williams    - ID: ID of the dump, if not 0 this will be the external id of the dump
1732654988cSPatrick Williams    - Type: Type of dump that was created.
1742654988cSPatrick Williams    - Size: Size of the dump
1752654988cSPatrick Williams
1762654988cSPatrick Williams### Dump entry interfaces
177f4febd00SPatrick Williams
1782654988cSPatrick Williams    -  **InitiateOffload**: Initiate the offload of the dump.
1792654988cSPatrick Williams        - OffloadUri: The URI where the dump should be offloaded.
1802654988cSPatrick Williams
1812654988cSPatrick Williams#### The properties common to all dumps
182f4febd00SPatrick Williams
1832654988cSPatrick WilliamsThere will be a base dump entry with properties common to all types of dumps
184f4febd00SPatrick Williams
1852654988cSPatrick Williams- ID: Id of the dump
1862654988cSPatrick Williams- Timestamp: Dump creation timestamp
1872654988cSPatrick Williams- Size: Total size of the dump
1882654988cSPatrick Williams- OffloadComplete: Set to true when offload is completed
1892654988cSPatrick Williams- OffloadURI: The URI for offloading the dump, set while initiating the offload.
190f4febd00SPatrick Williams  Specific types need to inherit from this common dump entry class and add
191f4febd00SPatrick Williams  specific properties.
1922654988cSPatrick Williams
1932654988cSPatrick Williams#### Additional propertries based on dump types
1942654988cSPatrick Williams
1952654988cSPatrick Williams##### BMC Dump
196f4febd00SPatrick Williams
1972654988cSPatrick Williams- No Additional properties
1982654988cSPatrick Williams
1992654988cSPatrick Williams##### System Dump
200f4febd00SPatrick Williams
2012654988cSPatrick Williams- External Source ID: ID provided by the Host, this id will be used for all
2022654988cSPatrick Williams  communication to the source of the dump, in this case, Host.
2032654988cSPatrick Williams
2042654988cSPatrick Williams### Flow of dumps collected and stored in the Host
205f4febd00SPatrick Williams
206f4febd00SPatrick WilliamsPLDM is provided as an example dump transport and notification mechanism between
207f4febd00SPatrick WilliamsHost and BMC.
2082654988cSPatrick Williams
2092654988cSPatrick Williams- Create: Initiate methods to create the dump in Host.
2102654988cSPatrick Williams- Generating the dump in Host
2112654988cSPatrick Williams- Host notifies the creation of dump through PLDM to BMC.
2122654988cSPatrick Williams- PLDM call Notify to create the dump entry
2132654988cSPatrick Williams- InitiateOffload: Dump manager request Host to start offload
2142654988cSPatrick Williams- The Host sends the dump through PLDM, and PLDM on BMC sends it out.
2152654988cSPatrick Williams
2162654988cSPatrick Williams## Alternatives Considered
217f4febd00SPatrick Williams
2182654988cSPatrick Williams- Offloading Host dumps through Host instead of BMC, but considered BMC option
219f4febd00SPatrick Williams  due to following reasons - The BMC is considered the "management path" of most
220f4febd00SPatrick Williams  servers and often the Host is not connected to the desired network for the
221f4febd00SPatrick Williams  offload location. - BMC provides one common point for all dumps generated in
222f4febd00SPatrick Williams  the system for external management appliance.
2232654988cSPatrick Williams
2242654988cSPatrick Williams## Impacts
225f4febd00SPatrick Williams
2262654988cSPatrick Williams- The existing BMC dump interface needs to be re-used. The current interface is
2272654988cSPatrick Williams  not accepting a dump type, so a new interface to create the dump with type
2282654988cSPatrick Williams  will be provided for BMC dump also without changing the existing interface.
2292654988cSPatrick Williams- Modifying the BMC dump infrastructure to support additional dumps.
2302654988cSPatrick Williams- openpower-proc-control will be updated to call memory preserving chip-ops and
2312654988cSPatrick Williams  to handle memory preserving reboot on POWER platforms.
232f4febd00SPatrick Williams- Additional system state to indicate the system is collecting debug data While
233f4febd00SPatrick Williams  performing memory preserving reboot.
2342654988cSPatrick Williams
2352654988cSPatrick Williams## Testing
236f4febd00SPatrick Williams
2372654988cSPatrick Williams- Unit tests to make sure the dump manager interfaces are working.
238f4febd00SPatrick Williams- Following integration tests will be executed to make sure the dump manager is
239f4febd00SPatrick Williams  working as expected. - Test creating host dumps and offloading it. - Test
240f4febd00SPatrick Williams  deleting host dumps - Create/List/Offload/Delete BMC dumps to make sure
241f4febd00SPatrick Williams  existing dump manager functions are not broken.
2422654988cSPatrick Williams- Automated tests for dump Create/List/Offload/Delete to avoid regression.
243