xref: /openbmc/docs/designs/dump-manager.md (revision 0ee8da09)
1# Dump Manager Design
2
3Author:
4  Dhruvaraj Subhashchandran <dhruvaraj@in.ibm.com>
5
6Other contributors:
7
8Created: 12/12/2019
9
10## Problem Description
11During a crash or a host failure, an event monitor mechanism generates an error
12log, but the size of the error log is limited to few kilobytes, so all the data
13from the crash or failure may not fit into an error log. The additional data
14required for the debugging needs to be collected as a dump.
15The existing OpenBMC dump interfaces support only the dumps generated on
16the BMC and dump manager doesn't support download operations.
17
18## Glossary
19
20- **System Dump**: A dump of the Host's main memory and processor registers.
21    [Read More](https://en.wikipedia.org/wiki/Core_dump)
22- **Memory Preserving Reboot(MPR)**: A method of reboot with preserving the
23    contents of the volatile memory
24- **PLDM**: An interface and data model to access low-level platform inventory,
25    monitoring, control, event, and data/parameters transfer functions.
26    [ReadMore](https://github.com/openbmc/docs/blob/master/designs/pldm-stack.md)
27- **Machine Check Exception**: A severe error inside a processor core that
28    causes a processor core to stop all processing activities.
29- **BMCWeb**: An embedded webserver for OpenBMC. [More Info](https://github.com/openbmc/bmcweb/blob/master/README.md)
30
31## Background and References
32Various types of dumps are created based on the type and source of failure.
33The dump manager, which is orchestrating the collection and offload, needs to
34provide methods to create, store the dump details, and offload it. Additionally,
35some sources allow the dump to be extracted manually without a failure to
36understand the current state or analyze a suspected problems.
37
38### Type of dumps supported.
39These are some of the dumps supported by dump manager.
40
41#### BMC Dump
42A dump collected when there is a failure in the BMC with various debug
43information. This type of dump can be generated by user too to get the current
44state of the BMC. This dump gets collected on BMC and stored on BMC
45
46#### System Dump
47A system dump is a collection of debugging information from the host, this may
48include host memory and/or register data. This dump can be initiated by BMC and
49there can be system reboots while collecting the dump. Dump gets stored in the
50host memory and offloaded through the BMC or get collected directly to BMC based
51on the size of dump contents and the available space on the BMC to store the
52dump.
53
54#### Resource dump
55A special type of host dump is initiated and collected by the host based on the
56request from a user. No system state change may be necessary during the
57collection of this kind of a dump. A resource indicator may be used to indicate
58what data to be collected. The content of the dump can be decided by the host.
59This dump can be stored in host memory and offloaded through BMC or host can
60send this dump down to BMC once the collection is completed based on the size
61of the dump and the availability of space on the BMC.
62
63#### Hostboot dump
64A dump that can be collected during the boot failure of the host. This dump may
65or may not include the contents of the main memory and/or the processor registers.
66
67#### Hardware dump
68This dump can be collected during a critical failure on the hardware
69components like the processor while the host is booted and running. The host may
70stop during this dump and may collect various processor states and/or memory
71contents to help to debug the failure.
72
73
74## Requirements
75
76![Dump use cases - Users are examples, not a mandatory part of implementation](https://user-images.githubusercontent.com/16666879/70888651-d8f44080-2006-11ea-8596-ed4c321cfaa6.png)
77#### Dump manager needs to provide interfaces for
78- Create a dump: Initiate the creation of the dump, based on an error condition
79  or a user request.
80- List the dumps: List all dumps present in the BMC.
81- Get a dump: Offload the dump to an external entity.
82- Notify: Notify the dump manager that a new dump is created.
83- Delete the dump.
84- Mark a dump as offloaded to an external entity.
85- Set the dump policies like disabling a type of dump or dump overwriting policy.
86
87## Proposed Design
88There are various types of dumps; interfaces are standard for most of the dumps,
89but huge dumps which cannot be stored on the BMC needs additional support.
90This document will explain the design of different types of dumps. The dumps are
91classified based on where it is collected, stored, and how it is extracted. Two
92major types are
93
94- Collected by BMC and stored on BMC.
95- Collected and stored on an attached entity but offloaded through BMC.
96
97This proposal focuses on re-using the existing [phosphor-debug-collector](https://github.com/openbmc/phosphor-debug-collector), which
98collects the dumps from BMC.
99
100
101![phosphor-debug-collector](https://user-images.githubusercontent.com/16666879/72070844-7b56c980-3310-11ea-8d26-07d33b84b980.jpeg)
102
103#### Brief design points of existing phosphor-debug-collector
104- A create interface which assumes the type is BMC dump and returns an ID to the
105  caller for the user-initiated dumps.
106- An external request of dump is considered as a user-initiated BMC dump and
107  initiate BMC dump collection with a tool named dreport with type manual dump
108- The dreport create dump file in the dump path provided by the dump manager code.
109- A watch process is forked by the dump manager to catch the signal for the file
110  close on that path and assume the dump collection is completed.
111- The watch process calls an internal dbus interface to create the dump entry
112  with size and timestamp.
113- The path of dump is based on the predefined base path and the id of the dump.
114- When the request comes for offload, the file is downloaded from the dump base
115  path +id, no update in the entry whether dump is offloaded
116- Deleting a dump by deleting the entry and internal file also will be deleted.
117- There are system generated dumps based on error log or core dump, which works
118  similar to user-initiated dumps with the following difference.
119- No external create D-Bus interface is needed, but a monitor application will
120  monitor for specific conditions like the existence of a core file or a signal
121  for new error log creation of a few selected types.
122- Once the event occurred, the monitor will call an internal D-Bus interface
123  dump manager to create the dump with the type of dump.
124- The dump manager calls dreport with a dump type got from the monitor and write
125  data to a path based on dump id.
126
127#### Updates proposed to the existing phosphor-debug-collector.
128- External D-Bus interface needs to specify the type of the dump since a user
129  can request multiple types of dumps
130- Create will be returning an id which can be mapped to the actual dump once it
131  is created.
132- A Notify interface is provided for notifying the creation of a dump outside
133  the BMC but offloaded through BMC.
134- The InitiateOffload function will be implemented to download the dump.
135- Status of the dump, whether offloaded or not, will be added to the dump entry.
136
137### Dump manager interfaces.
138- Dump Manager DBus object provides interfaces for creating and managing dump
139
140- Interfaces
141    - **Create**: The interface to create a dump, called by clients to initiate
142      user-initiated dump.
143        - AdditionalData: The additional data, if any, for initiating the dump.
144            The key in this case should be an implementation specific enum
145            defined for the specific type of dump either in xyz or in a domain.
146            The values can be either a string or a 64 bit number.
147            The enum-format string is required to come from a parallel class
148            with this specific Enum name. All of the Enum strings should be in
149            the format
150            'domain.Dump.Create.CreateParameters.ParamName'.
151            e.g.:
152              {
153                "key1": "value1",
154                "key2": "value2"
155              }
156            ends up in AdditionaData like:
157              ["KEY1=value1", "KEY2=value2"]
158
159    - **Notify**: Notify the dump manager that a new dump is created.
160        - ID: ID of the dump, if not 0 this will be the external id of the dump
161        - Type: Type of dump that was created.
162        - Size: Size of the dump
163
164 ### Dump entry interfaces
165    -  **InitiateOffload**: Initiate the offload of the dump.
166        - OffloadUri: The URI where the dump should be offloaded.
167
168#### The properties common to all dumps
169There will be a base dump entry with properties common to all types of dumps
170- ID: Id of the dump
171- Timestamp: Dump creation timestamp
172- Size: Total size of the dump
173- OffloadComplete: Set to true when offload is completed
174- OffloadURI: The URI for offloading the dump, set while initiating the offload.
175Specific types need to inherit from this common dump entry class
176and add specific properties.
177
178#### Additional propertries based on dump types
179
180##### BMC Dump
181- No Additional properties
182
183##### System Dump
184- External Source ID: ID provided by the Host, this id will be used for all
185  communication to the source of the dump, in this case, Host.
186
187
188### Flow of dumps collected and stored in the Host
189PLDM is provided as an example dump transport and notification mechanism
190between Host and BMC.
191
192- Create: Initiate methods to create the dump in Host.
193- Generating the dump in Host
194- Host notifies the creation of dump through PLDM to BMC.
195- PLDM call Notify to create the dump entry
196- InitiateOffload: Dump manager request Host to start offload
197- The Host sends the dump through PLDM, and PLDM on BMC sends it out.
198
199
200## Alternatives Considered
201- Offloading Host dumps through Host instead of BMC, but considered BMC option
202  due to following reasons
203        - The BMC is considered the "management path" of most servers and often
204          the Host is not connected to the desired network for the offload
205          location.
206        - BMC provides one common point for all dumps generated in the system
207          for external management appliance.
208
209## Impacts
210- The existing BMC dump interface needs to be re-used.  The current interface is
211  not accepting a dump type, so a new interface to create the dump with type
212  will be provided for BMC dump also without changing the existing interface.
213- Modifying the BMC dump infrastructure to support additional dumps.
214- openpower-proc-control will be updated to call memory preserving chip-ops and
215  to handle memory preserving reboot on POWER platforms.
216- Additional system state to indicate the system is collecting debug data
217  While performing memory preserving reboot.
218
219
220## Testing
221- Unit tests to make sure the dump manager interfaces are working.
222- Following integration tests will be executed to make sure the dump manager
223is working as expected.
224        - Test creating host dumps and offloading it.
225        - Test deleting host dumps
226        - Create/List/Offload/Delete BMC dumps to make sure existing
227          dump manager functions are not broken.
228- Automated tests for dump Create/List/Offload/Delete to avoid regression.
229