1# Intel IPMI Platform Events parsing
2
3In many cases Manufacturers-specific IPMI Platfrom Events are stored in binary
4form in System Event Log making it very difficult to easily understand platfrom
5state. This document specifies a solution for presenting Manufacturer Spcific
6IPMI Platform Events in a human readable form by defining a generic framework
7for parsing and defining new messages in an easy and scallable way. Example of
8events originating from Intel Management Engine (ME) is used as a case-study.
9General design of the solution is followed by tailored-down implementation for
10OpenBMC described in detail.
11
12## Glossary
13
14- **IPMI** - Intelligent Platform Management Interface; standarized binary
15  protocol of communication between endpoints in datacenter `[1]`
16- **Platform Event** - specific type of IPMI binary payload, used for encoding
17  and sending asynchronous one-way messages to recipient `[1]-29.3`
18- **ME** - Intel Management Engine, autonomous subsystem used for remote
19  datacenter management `[5]`
20- **Redfish** - modern datacenter management protocol, built around REST
21  protocol and JSON format `[2]`
22- **OpenBMC** - open-source BMC implementation with Redfish-oriented interface
23  `[3]`
24
25## Problem statement
26
27IPMI is designed to be a compact and efficient binary format of data exchanged
28between entities in data-center. Recipient is responsible to receive data,
29properly analyze, parse and translate the binary representation to
30human-readable format. IPMI Platform Events is one type of these messages, used
31to inform recipient about occurence of a particular well defined situation.
32
33Part of IPMI Platform Events are standarized and described in the specification
34and already have an open-source implementation ready `[6]`, however this is only
35part of the spectrum. Increasing complexity of datacenter systems have multipled
36possible sources of events which are defined by manufacturer-specirfic
37extenstions to platform event data. One of these sources is Intel ME, which is
38able to deliver information about its own state of operation and in some cases
39notify about certain erroneous system-wide conditions, like interface errors.
40
41These OEM-specific messages lacks support in existing open-source
42implementations. They require manual, documentation-based `[5]` implementation,
43which is historically the source of many interpretation errors. Any document
44update requires manual code modification according to specific changes which is
45not efficient nor scalable. Furthermore - documentation is not always clear on
46event severity or possible resolution actions.
47
48## Solution
49
50Generic OEM-agnostic algorithm is proposed to achieve human-readable output for
51binary IPMI Platform Event.
52
53In general, each event consists of predefined payload:
54
55```ascii
56[GeneratorID][SensorNumber][EventType][EventData[2]]
57```
58
59where:
60
61- `GeneratorID` - used to determine source of the event,
62- `SensorNumber` - generator-specific unique sensor number,
63- `EventType` - sensor-specific group of events,
64- `EventData` - array with detailed event data.
65
66One might observe, that each consecutive event field is narrowing down the
67domain of event interpretations, starting with `GeneratorID` at the top, ending
68with `EventData` at the end of a `decision tree`. Software should be able to
69determine meaning of the event by using the `divide and conquer` approach for
70predefined list of well known event definitions. One should notice the fact,
71that such decision tree might be also needed for breakdown of `EventData`, as in
72many OEM-specific IPMI implementations that is also the case.
73
74Implementation should be therefore a series of filters with increasing
75specialization on each level. Recursive algorithm for this will look like the
76following:
77
78```ascii
79     +-------------+           +*Step 1*               +
80     | +---------+ |           |                       |
81     | |Currently| |           |Analyze and choose     |
82+----> |analyzed +------------>+proper 'subtree' parser|
83|    | |chunk    | |           |                       |
84|    | +---------+ |           +                       +         +---------+
85|    | +---------+ |                                             |Remainder|
86|    | |Remainder| |                                             |         |
87|    | |         | |           +*Step 2*               +         |         |
88|    | |         | |           |                       |         |         |
89|    | |         +---------------------------------------------->+         +---+
90|    | |         | |           |'Cut' the remainder    |         |         |   |
91|    | |         | |           |and go back to Step 1  |         |         |   |
92|    | |         | |           +                       +         |         |   |
93|    | +---------+ |                                             |         |   |
94|    +-------------+                                             +---------+   |
95|                                                                              |
96|                                                                              |
97+------------------------------------------------------------------------------+
98```
99
100Described process will be repeated until there is nothing to break-down and
101singular unique event interpretation will be determined (an `EventId`).
102
103Not all event data is a decision point - certain chunks of data should be kept
104as-is or formatted in certain way, to be introduced in human-readable `Message`.
105Parser operation should also include a logic for extracting `Parameters` during
106the traversal process.
107
108Effectively, both `EventId` and an optional collection of `Parameters` should be
109then used as input for lookup mechanic to generate final `Event Message`. Each
110message consists of following entries:
111
112- `EventId` - associated unique event,
113- `Severity` - determines how severely this particular event might affect usual
114  datacenter operation,
115- `Resolution` - suggested steps to mitigate possible problem,
116- `Message` - human-readable message, possibly with predefined placeholders for
117  `Parameters`.
118
119### Example
120
121Example of such message parsing process is shown below:
122
123```ascii
124        +-------------+
125        |[GeneratorId]|
126        |0x2C (ME)    |
127        +------+------+
128               |
129        +------v---------+
130        |[SensorNumber]  |
131. . . . |0x17 (ME Health)|
132        +------+---------+
133               |
134        +------v---------+
135        |[EventType]     |
136. . . . |0x00 (FW Status)|
137        +------+---------+
138               |
139        +------v-------------------+
140        |[EventData[0]]            |           +-------------------------------------------+
141. . . . |0x0A (FlashWearoutWarning)+------+    |ParsedEvent|                               |
142        +------+-------------------+      |    +-----------+                               |
143               |                          +---->'EventId' = FlashWearoutWarning            |
144        +------v----------+               +---->'Parameters' = [ toDecimal(EventData[1]) ] |
145        |[EventData[1]]   |               |    |                                           |
146        |0x## (Percentage)+---------------+    +-------------------------------------------+
147        +-----------------+
148```
149
150, determined `ParsedEvent` might be then passed to lookup mechanism, which
151contains human-readable information for each `EventId`:
152
153```ascii
154+------------------------------------------------+
155|+------------------------------------------------+
156||+------------------------------------------------+
157||| EventId: FlashWearoutWarning                   |
158||| Severity: Warning                              |
159||| Resolution: No immediate repair action needed  |
160||| Message: Warning threshold for number of flash |
161|||          operations has been exceeded. Current |
162|||          percentage of write operations        |
163+||          capacity: %1                          |
164 +|                                                |
165  +------------------------------------------------+
166
167```
168
169## Solution in OpenBMC
170
171Proposed algorithm is delivered as part of open-source OpenBMC project `[3]`. As
172this software stack is built with micro-service architecture in mind, the
173implementation had to be divided into multiple parts:
174
175- IPMI Platform Event payload unpacking (`[7]`)
176  - `openbmc/intel-ipmi-oem/src/sensorcommands.cpp`
177  - `openbmc/intel-ipmi-oem/src/ipmi_to_redfish_hooks.cpp`
178- Intel ME event parsing
179  - `openbmc/intel-ipmi-oem/src/me_to_redfish_hooks.cpp`
180- Detected events storage (`[4]`)
181  - `systemd journal`
182- Human-readable message lookup (`[2], [8]`)
183  - `MessageRegistry in bmcweb`
184    - `openbmc/bmcweb/redfish-core/include/registries/openbmc_message_registry.hpp`
185
186### OpenBMC flow
187
188#### Event arrival
189
1901. IPMI driver notifies `intel-ipmi-oem` about incoming `Platform Event`
191   (NetFn=0x4, Cmd=0x2)
192   - Proper command handler in `intel-ipmi-oem/src/sensorcommands.cpp` is
193     notified
1942. Message is forwarded to `intel-ipmi-oem/src/ipmi_to_redfish_hooks.cpp` as
195   call to `sel::checkRedfishHooks`
196   - `sel::checkRedfishHooks` analyzes the data, `BIOS` events are handled
197     in-place, while `ME` events are delegated to
198     `intel-ipmi-oem/src/me_to_redfish_hooks.cpp`
1993. `me::messageHook` is called with the payload. Parsing algorithm determines
200   final `EventId` and `Parameters`
201   - `me::utils::storeRedfishEvent(EventId, Parameters)` is called, it stores
202     event securely in `system journal`
203
204#### Platform Event payload parsing
205
206Each IPMI Platform Event is parsed using aforementioned `me::messageHook`
207handler. Implementation of the proposed algorithm is the following:
208
209##### 1. Determine EventType
210
211Based on `EventType` proper designated handler is called.
212
213```cpp
214namespace me {
215static bool messageHook(const SELData& selData, std::string& eventId,
216                        std::vector<std::string>& parameters)
217{
218    const HealthEventType healthEventType =
219        static_cast<HealthEventType>(selData.offset);
220
221    switch (healthEventType)
222    {
223        case HealthEventType::FirmwareStatus:
224            return fw_status::messageHook(selData, eventId, parameters);
225            break;
226
227        case HealthEventType::SmbusLinkFailure:
228            return smbus_failure::messageHook(selData, eventId, parameters);
229            break;
230    }
231    return false;
232}
233}
234```
235
236##### 2. Call designated handler
237
238Example of handler for `FirmwareStatus`, tailored down to essential distinctive
239use cases:
240
241```cpp
242namespace fw_status {
243static bool messageHook(const SELData& selData, std::string& eventId,
244                        std::vector<std::string>& parameters)
245{
246    // Maps EventData[0] to either a resolution or further action
247    static const boost::container::flat_map<
248        uint8_t,
249        std::pair<std::string, std::optional<std::variant<utils::ParserFunc,
250                                                          utils::MessageMap>>>>
251        eventMap = {
252            // EventData[0]=0
253            // > MessageId=MERecoveryGpioForced
254            {0x00, {"MERecoveryGpioForced", {}}},
255
256            // EventData[0]=3
257            // > call specific handler do determine MessageId and Parameters
258            {0x03, {{}, flash_state::messageHook}},
259
260            // EventData[0]=7
261            // > MessageId=MEManufacturingError
262            // > Use manufacturingError map to translate EventData[1] to string
263            //   and add it to Parameters collection
264            {0x07, {"MEManufacturingError", manufacturingError}},
265
266            // EventData[0]=9
267            // > MessageId=MEFirmwareException
268            // > Use a function to log specified byte of payload as Parameter
269            //   in chosen format. Here it stores 2-nd byte in hex format.
270            {0x09, {"MEFirmwareException", utils::logByteHex<2>}}
271
272    return utils::genericMessageHook(eventMap, selData, eventId, parameters);
273}
274
275// Maps EventData[1] to specified message
276static const boost::container::flat_map<uint8_t, std::string>
277    manufacturingError = {
278        {0x00, "Generic error"},
279        {0x01, "Wrong or missing VSCC table"}}};
280}
281```
282
283##### 3. Store parsed log in system
284
285Cascading calls of functions, logging utilities and map resolutions are
286resulting in populating both `std::string& eventId` and
287`std::vector<std::string>& parameters`. This data is then used to form a valid
288system log and stored in system journal.
289
290#### Event data listing
291
292Event data is accessible as `Redfish` resources in two places:
293
294- `MessageRegistry` - stores all event 'metadata' (severity, resolution notes,
295  messageId)
296- `EventLog` - lists all detected events in the system in processed,
297  human-readable form
298
299##### MessageRegistry
300
301Implementation of `bmcweb`
302[MessageRegistry](http://redfish.dmtf.org/schemas/v1/MessageRegistry.json)
303contents can be found at
304`openbmc/bmcweb/redfish-core/include/registries/openbmc_message_registry.hpp`.
305
306**Intel-specific events have proper prefix in MessageId: either 'BIOS' or
307'ME'.**
308
309It can be read by the user by calling `GET` on Redfish resource:
310`/redfish/v1/Registries/OpenBMC/OpenBMC`. It contains JSON array of entries in
311standard Redfish format, like so:
312
313```json
314"MEFlashWearOutWarning": {
315    "Description": "Indicates that Intel ME has reached certain threshold of flash write operations.",
316    "Message": "Warning threshold for number of flash operations has been exceeded. Current percentage of write operations capacity: %1",
317    "NumberOfArgs": 1,
318    "ParamTypes": [
319        "number"
320    ],
321    "Resolution": "No immediate repair action needed.",
322    "Severity": "Warning"
323}
324```
325
326##### EventLog
327
328System-wide [EventLog](http://redfish.dmtf.org/schemas/v1/LogService.json) is
329implemented in `bmcweb` at `openbmc/bmcweb/redfish-core/lib/log_services.hpp`.
330
331It can be read by the user by calling `GET` on Redfish resource:
332`/redfish/v1/Systems/system/LogServices/EventLog`. It contains JSON array of log
333entries in standard Redfish format, like so:
334
335```json
336{
337  "@odata.id": "/redfish/v1/Systems/system/LogServices/EventLog/Entries/37331",
338  "@odata.type": "#LogEntry.v1_4_0.LogEntry",
339  "Created": "1970-01-01T10:22:11+00:00",
340  "EntryType": "Event",
341  "Id": "37331",
342  "Message": "Warning threshold for number of flash operations has been exceeded. Current percentage of write operations capacity: 50",
343  "MessageArgs": ["50"],
344  "MessageId": "OpenBMC.0.1.MEFlashWearOutWarning",
345  "Name": "System Event Log Entry",
346  "Severity": "Warning"
347}
348```
349
350## References
351
3521. [IPMI Specification v2.0](https://www.intel.pl/content/www/pl/pl/products/docs/servers/ipmi/ipmi-second-gen-interface-spec-v2-rev1-1.html)
3532. [DMTF Redfish Schema Guide](https://www.dmtf.org/sites/default/files/standards/documents/DSP2046_2019.3.pdf)
3543. [OpenBMC](https://github.com/openbmc)
3554. [OpenBMC Redfish Event logging](https://github.com/openbmc/docs/blob/master/architecture/redfish-logging-in-bmcweb.md)
3565. [Intel ME External Interfaces Specification](https://www.intel.com/content/dam/www/public/us/en/documents/technical-specifications/intel-power-node-manager-v3-spec.pdf)
3576. [ipmitool](https://github.com/ipmitool/ipmitool)
3587. [OpenBMC Intel IPMI support](https://github.com/openbmc/intel-ipmi-oem)
3598. [OpenBMC BMCWeb](https://github.com/openbmc/bmcweb)
360