xref: /openbmc/docs/designs/error-log-handling-for-phal.md (revision f4febd002df578bad816239b70950f84ea4567e8)
1# Error handling for power Hardware Abstraction Layer (pHAL)
2
3Author: Devender Rao <devenrao@in.ibm.com> <devenrao>
4
5Other contributors: None
6
7Created: 14/01/2020
8
9## Problem Description
10
11Proposal to provide a mechanism to convert the failure data captured as part of
12power Hardware Abstraction Layer(pHAL) library calls to [Platform Event Log][1]
13(PEL) format.
14
15## Background and References
16
17OpenBmc Applications use the pHAL layer for hardware access and hardware
18initialization, any software/hardware error returned by the pHAL layer need to
19be converted to PEL format for logging the error entry. PEL helps to improve the
20firmware and platform serviceability during product development, manufacturing
21and in customer environment.
22
23Error data includes register data, targets to [guard][2] and callout. Guard
24refers to the action of "guarding" faulty hardware from impacting future system
25operation. Callout points to a specific hardware with in the server that relates
26to the identified error.
27
28[Phosphor-logging][3] [Create][4] interface is used for creating PELs.
29
30pHAL layer constitutes below libraries and and these libraries return different
31return codes.
32
331. libipl used for initial program load
342. libfdt for device tree access
353. libekb for hardware procedure execution
364. libpdbg for hardware access
37
38Proposal is to structure the return data to a standard return code format so
39that the caller can just handle the single return code format for conversion to
40PEL.
41
42### Glossary
43
44pHAL: power Hardware Abstraction Layer. pHAL is group of libraries running in
45BMC. These libraries are used by Open Power specific application for hardware
46complex interactions, hostboot and Self Boot Engine initialization, diagnostics
47and debugging.
48
49libfdt: pHAL uses to construct the in-memory tree structure of all targets.
50[Reference][5]
51
52libpdbg: library to allow debugging of the host POWER processors from the BMC
53[Reference][6]
54
55MRW: Machine readable workbook. An XML description of a machine as specified by
56the system owner.
57
58HWP: Hardware procedure. A "black box" code module supplied by the hardware team
59that initializes host processor and memory subsystems in a platform -independent
60fashion.
61
62Device Tree: A device tree is a data structure describing the hardware
63components of a particular computer so that the operating system's kernel can
64use and manage those components, including the CPU or CPUs, the memory, the
65buses and the peripherals. [Reference][7]
66
67EKB: EKB library contains all the hardware procedures (HWP) for the specific
68platform and corresponding error XML files for each hardware procedure.
69
70PEL: [Platform Entity Log][1]
71
72## Requirement
73
74### libekb
75
76EKB library contains hardware procedures for the specific platform and the
77corresponding error xml files for each hardware procedure. Error XML specifies
78attribute data, targets to callout, targets to guard, and targets to deconfigure
79for a specific error. Parsers in EKB library parse the error XML file and
80generate a c++ header file which is used by the hardware procedure in capturing
81the failure data.
82
83Add parser in libekb to parse the error XML file and provide methods that can
84parse the failure data returned by the hardware procedure methods and return
85data in key, value pairs so that the same can be used in the creation of PEL.
86
87### libipl
88
89Initial program load library used for booting the system. Library internally
90calls hardware procedures (HWP) of EKB library. Hardware procedure execution
91status need to be returned to the caller so that caller can create PEL on
92hardware procedure execution failure.
93
94### libpdbg
95
96libpdbg library is used for hardware access, any hardware access errors need to
97be captured as part of the PEL.
98
99### Message Registry Entries
100
101For errors to be raised in pHal corresponding error message registry entries
102need to be created in the [message registry][8].
103
104## Proposed design
105
106### Hardware procedure failure
107
108Add parser in libekb to parse the error XML file and provide methods that can
109parse the failure data returned by the hardware procedure methods and return
110data in key, value pairs so that the same can be used in the [Create][4]
111interface for the creation of PEL.
112
113Inventory strings for the targets to Callout/Guard/Deconfig need to be added to
114the additional data section of the Create interface.
115
116Applications need to register callback methods in libekb library to get back the
117error logging traces.
118
119Debug traces returned through the callback method will be added to the PEL.
120
121### libipl internal failure
122
123Applications need to register callback methods in libipl library to get back the
124error logging traces.
125
126Debug traces returned through the callback method will be added to the PEL.
127
128### libpdbg internal failure
129
130Applications need to register callback methods to get the debug traces from
131libpdbg library.
132
133Debug traces returned through the callback method will be added to the PEL.
134
135## Sequence diagrams
136
137### Register for debug traces and boot errors
138
139![image](https://user-images.githubusercontent.com/26330444/76838214-e4e7dc80-6859-11ea-818c-031bf5a191d6.png)
140
141### Process debug traces
142
143![image](https://user-images.githubusercontent.com/26330444/76838355-152f7b00-685a-11ea-9975-4091ae1064cc.png)
144
145### Process boot failures
146
147![image](https://user-images.githubusercontent.com/26330444/76838503-3a23ee00-685a-11ea-9f2a-559e233b408f.png)
148
149## Alternatives Considered
150
151None
152
153## Impacts
154
155None
156
157## Future changes
158
159At present using [Create][4] by providing the data in std::map format the same
160will be changed to JSON format when the corresponding support to pass JSON files
161to the Create interface is added.
162
163## Testing
164
1651. Simulate hardware procedure failure and check if PEL is created.
166
167[1]:
168  (https://github.com/openbmc/phosphor-logging/blob/master/extensions/openpower-pels/README.md)
169[2]:
170  (https://gerrit.openbmc.org/#/c/openbmc/docs/+/27804/2/designs/gard_on_bmc.md)
171[3]: (https://github.com/openbmc/phosphor-logging)
172[4]:
173  (https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Logging/Create.interface.yaml)
174[5]: (https://github.com/dgibson/dtc)
175[6]: (https://github.com/open-power/pdbg)
176[7]: (https://elinux.org/Device_Tree_Reference)
177[8]:
178  (https://github.com/openbmc/phosphor-logging/blob/master/extensions/openpower-pels/registry/message_registry.json)
179