1# Intel IPMI Platform Events parsing
2In many cases Manufacturers-specific IPMI Platfrom Events are stored in binary
3form in System Event Log making it very difficult to easily understand platfrom
4state. This document specifies a solution for presenting Manufacturer Spcific
5IPMI Platform Events in a human readable form by defining a generic framework
6for parsing and defining new messages in an easy and scallable way.
7Example of events originating from Intel Management Engine (ME) is used as
8a case-study. General design of the solution is followed by tailored-down
9implementation for OpenBMC described in detail.
11## Glossary
12- **IPMI** - Intelligent Platform Management Interface; standarized binary
13    protocol of communication between endpoints in datacenter `[1]`
14- **Platform Event** - specific type of IPMI binary payload, used for encoding
15    and sending asynchronous one-way messages to recipient `[1]-29.3`
16- **ME** - Intel Management Engine, autonomous subsystem used for remote
17    datacenter management `[5]`
18- **Redfish** - modern datacenter management protocol, built around REST
19    protocol and JSON format `[2]`
20- **OpenBMC** - open-source BMC implementation with Redfish-oriented
21    interface `[3]`
24## Problem statement
25IPMI is designed to be a compact and efficient binary format of data exchanged
26between entities in data-center. Recipient is responsible to receive data,
27properly analyze, parse and translate the binary representation to
28human-readable format. IPMI Platform Events is one type of these messages,
29used to inform recipient about occurence of a particular well defined situation.
31Part of IPMI Platform Events are standarized and described in the specification
32and already have an open-source implementation ready `[6]`, however this is only
33part of the spectrum. Increasing complexity of datacenter systems have multipled
34possible sources of events which are defined by manufacturer-specirfic
35extenstions to platform event data. One of these sources is Intel ME, which
36is able to deliver information about its own state of operation and in some
37cases notify about certain erroneous system-wide conditions, like interface
40These OEM-specific messages lacks support in existing open-source
41implementations. They require manual, documentation-based `[5]` implementation,
42which is historically the source of many interpretation errors. Any document
43update requires manual code modification according to specific changes which is
44not efficient nor scalable. Furthermore - documentation is not always
45clear on event severity or possible resolution actions.
47## Solution
48Generic OEM-agnostic algorithm is proposed to achieve human-readable output
49for binary IPMI Platform Event.
51In general, each event consists of predefined payload:
56- `GeneratorID` - used to determine source of the event,
57- `SensorNumber` - generator-specific unique sensor number,
58- `EventType` - sensor-specific group of events,
59- `EventData` - array with detailed event data.
61One might observe, that each consecutive event field is narrowing down the
62domain of event interpretations, starting with `GeneratorID` at the top, ending
63with `EventData` at the end of a `decision tree`. Software should be able
64to determine meaning of the event by using the `divide and conquer` approach
65for predefined list of well known event definitions. One should notice the fact,
66that such decision tree might be also needed for breakdown of `EventData`,
67as in many OEM-specific IPMI implementations that is also the case.
69Implementation should be therefore a series of filters with increasing
70specialization on each level. Recursive algorithm for this will look like
71the following:
73     +-------------+           +*Step 1*               +
74     | +---------+ |           |                       |
75     | |Currently| |           |Analyze and choose     |
76+----> |analyzed +------------>+proper 'subtree' parser|
77|    | |chunk    | |           |                       |
78|    | +---------+ |           +                       +         +---------+
79|    | +---------+ |                                             |Remainder|
80|    | |Remainder| |                                             |         |
81|    | |         | |           +*Step 2*               +         |         |
82|    | |         | |           |                       |         |         |
83|    | |         +---------------------------------------------->+         +---+
84|    | |         | |           |'Cut' the remainder    |         |         |   |
85|    | |         | |           |and go back to Step 1  |         |         |   |
86|    | |         | |           +                       +         |         |   |
87|    | +---------+ |                                             |         |   |
88|    +-------------+                                             +---------+   |
89|                                                                              |
90|                                                                              |
93Described process will be repeated until there is nothing to break-down and
94singular unique event interpretation will be determined (an `EventId`).
96Not all event data is a decision point - certain chunks of data should be kept
97as-is or formatted in certain way, to be introduced in human-readable `Message`.
98Parser operation should also include a logic for extracting  `Parameters` during the traversal process.
100Effectively, both `EventId` and an optional collection of `Parameters` should be
101then used as input for lookup mechanic to generate final `Event Message`.
102Each message consists of following entries:
103- `EventId` - associated unique event,
104- `Severity` - determines how severely this particular event might affect usual
105    datacenter operation,
106- `Resolution` - suggested steps to mitigate possible problem,
107- `Message` - human-readable message, possibly with predefined placeholders for
108    `Parameters`.
110### Example
111Example of such message parsing process is shown below:
113        +-------------+
114        |[GeneratorId]|
115        |0x2C (ME)    |
116        +------+------+
117               |
118        +------v---------+
119        |[SensorNumber]  |
120. . . . |0x17 (ME Health)|
121        +------+---------+
122               |
123        +------v---------+
124        |[EventType]     |
125. . . . |0x00 (FW Status)|
126        +------+---------+
127               |
128        +------v-------------------+
129        |[EventData[0]]            |           +-------------------------------------------+
130. . . . |0x0A (FlashWearoutWarning)+------+    |ParsedEvent|                               |
131        +------+-------------------+      |    +-----------+                               |
132               |                          +---->'EventId' = FlashWearoutWarning            |
133        +------v----------+               +---->'Parameters' = [ toDecimal(EventData[1]) ] |
134        |[EventData[1]]   |               |    |                                           |
135        |0x## (Percentage)+---------------+    +-------------------------------------------+
136        +-----------------+
138, determined `ParsedEvent` might be then passed to lookup mechanism,
139which contains human-readable information for each `EventId`:
144||| EventId: FlashWearoutWarning                   |
145||| Severity: Warning                              |
146||| Resolution: No immediate repair action needed  |
147||| Message: Warning threshold for number of flash |
148|||          operations has been exceeded. Current |
149|||          percentage of write operations        |
150+||          capacity: %1                          |
151 +|                                                |
152  +------------------------------------------------+
156## Solution in OpenBMC
157Proposed algorithm is delivered as part of open-source OpenBMC project `[3]`.
158As this software stack is built with micro-service architecture in mind,
159the implementation had to be divided into multiple parts:
160- IPMI Platform Event payload unpacking  (`[7]`)
161  - `openbmc/intel-ipmi-oem/src/sensorcommands.cpp`
162  - `openbmc/intel-ipmi-oem/src/ipmi_to_redfish_hooks.cpp`
163- Intel ME event parsing
164  - `openbmc/intel-ipmi-oem/src/me_to_redfish_hooks.cpp`
165- Detected events storage (`[4]`)
166  - `systemd journal`
167- Human-readable message lookup (`[2], [8]`)
168  - `MessageRegistry in bmcweb`
169    - `openbmc/bmcweb/redfish-core/include/registries/openbmc_message_registry.hpp`
171### OpenBMC flow
172#### Event arrival
1731. IPMI driver notifies `intel-ipmi-oem` about incoming `Platform Event`
174    (NetFn=0x4, Cmd=0x2)
175   - Proper command handler in `intel-ipmi-oem/src/sensorcommands.cpp`
176        is notified
1772. Message is forwarded to `intel-ipmi-oem/src/ipmi_to_redfish_hooks.cpp`
178    as call to `sel::checkRedfishHooks`
179    - `sel::checkRedfishHooks` analyzes the data, `BIOS` events are handled
180        in-place, while `ME` events are delegated to `intel-ipmi-oem/src/me_to_redfish_hooks.cpp`
1813. `me::messageHook` is called with the payload. Parsing algorithm
182    determines final `EventId` and `Parameters`
183    - `me::utils::storeRedfishEvent(EventId, Parameters)` is called,
184        it stores event securely in `system journal`
186#### Platform Event payload parsing
187Each IPMI Platform Event is parsed using aforementioned `me::messageHook`
188handler. Implementation of the proposed algorithm is the following:
190##### 1. Determine EventType
191Based on `EventType` proper designated handler is called.
193namespace me {
194static bool messageHook(const SELData& selData, std::string& eventId,
195                        std::vector<std::string>& parameters)
197    const HealthEventType healthEventType =
198        static_cast<HealthEventType>(selData.offset);
200    switch (healthEventType)
201    {
202        case HealthEventType::FirmwareStatus:
203            return fw_status::messageHook(selData, eventId, parameters);
204            break;
206        case HealthEventType::SmbusLinkFailure:
207            return smbus_failure::messageHook(selData, eventId, parameters);
208            break;
209    }
210    return false;
214##### 2. Call designated handler
215Example of handler for `FirmwareStatus`, tailored down to essential distinctive
216use cases:
218namespace fw_status {
219static bool messageHook(const SELData& selData, std::string& eventId,
220                        std::vector<std::string>& parameters)
222    // Maps EventData[0] to either a resolution or further action
223    static const boost::container::flat_map<
224        uint8_t,
225        std::pair<std::string, std::optional<std::variant<utils::ParserFunc,
226                                                          utils::MessageMap>>>>
227        eventMap = {
228            // EventData[0]=0
229            // > MessageId=MERecoveryGpioForced
230            {0x00, {"MERecoveryGpioForced", {}}},
232            // EventData[0]=3
233            // > call specific handler do determine MessageId and Parameters
234            {0x03, {{}, flash_state::messageHook}},
236            // EventData[0]=7
237            // > MessageId=MEManufacturingError
238            // > Use manufacturingError map to translate EventData[1] to string
239            //   and add it to Parameters collection
240            {0x07, {"MEManufacturingError", manufacturingError}},
242            // EventData[0]=9
243            // > MessageId=MEFirmwareException
244            // > Use a function to log specified byte of payload as Parameter
245            //   in chosen format. Here it stores 2-nd byte in hex format.
246            {0x09, {"MEFirmwareException", utils::logByteHex<2>}}
248    return utils::genericMessageHook(eventMap, selData, eventId, parameters);
251// Maps EventData[1] to specified message
252static const boost::container::flat_map<uint8_t, std::string>
253    manufacturingError = {
254        {0x00, "Generic error"},
255        {0x01, "Wrong or missing VSCC table"}}};
259##### 3. Store parsed log in system
260Cascading calls of functions, logging utilities and map resolutions are
261resulting in populating both `std::string& eventId` and
262`std::vector<std::string>& parameters`. This data is then used to form a valid
263system log and stored in system journal.
265#### Event data listing
266Event data is accessible as `Redfish` resources in two places:
267- `MessageRegistry` - stores all event 'metadata'
268    (severity, resolution notes, messageId)
269- `EventLog` - lists all detected events in the system in processed,
270    human-readable form
272##### MessageRegistry
273Implementation of `bmcweb` [MessageRegistry](http://redfish.dmtf.org/schemas/v1/MessageRegistry.json)
274contents can be found at  `openbmc/bmcweb/redfish-core/include/registries/openbmc_message_registry.hpp`.
276**Intel-specific events have proper prefix in MessageId: either 'BIOS' or 'ME'.**
278It can be read by the user by calling `GET` on Redfish resource:
279`/redfish/v1/Registries/OpenBMC/OpenBMC`. It contains JSON array of entries
280in standard Redfish format, like so:
282"MEFlashWearOutWarning": {
283    "Description": "Indicates that Intel ME has reached certain threshold of flash write operations.",
284    "Message": "Warning threshold for number of flash operations has been exceeded. Current percentage of write operations capacity: %1",
285    "NumberOfArgs": 1,
286    "ParamTypes": [
287        "number"
288    ],
289    "Resolution": "No immediate repair action needed.",
290    "Severity": "Warning"
294##### EventLog
295System-wide [EventLog](http://redfish.dmtf.org/schemas/v1/LogService.json)
296is implemented in `bmcweb` at  `openbmc/bmcweb/redfish-core/lib/log_services.hpp`.
298It can be read by the user by calling `GET` on Redfish resource:
299`/redfish/v1/Systems/system/LogServices/EventLog`. It contains JSON array
300of log entries in standard Redfish format, like so:
303    "@odata.id": "/redfish/v1/Systems/system/LogServices/EventLog/Entries/37331",
304    "@odata.type": "#LogEntry.v1_4_0.LogEntry",
305    "Created": "1970-01-01T10:22:11+00:00",
306    "EntryType": "Event",
307    "Id": "37331",
308    "Message": "Warning threshold for number of flash operations has been exceeded. Current percentage of write operations capacity: 50",
309    "MessageArgs": [
310        "50"
311    ],
312    "MessageId": "OpenBMC.0.1.MEFlashWearOutWarning",
313    "Name": "System Event Log Entry",
314    "Severity": "Warning"
318## References
3191. [IPMI Specification v2.0](https://www.intel.pl/content/www/pl/pl/products/docs/servers/ipmi/ipmi-second-gen-interface-spec-v2-rev1-1.html)
3202. [DMTF Redfish Schema Guide](https://www.dmtf.org/sites/default/files/standards/documents/DSP2046_2019.3.pdf)
3213. [OpenBMC](https://github.com/openbmc)
3224. [OpenBMC Redfish Event logging](https://github.com/openbmc/docs/blob/master/architecture/redfish-logging-in-bmcweb.md)
3235. [Intel ME External Interfaces Specification](https://www.intel.com/content/dam/www/public/us/en/documents/technical-specifications/intel-power-node-manager-v3-spec.pdf)
3246. [ipmitool](https://github.com/ipmitool/ipmitool)
3257. [OpenBMC Intel IPMI support](https://github.com/openbmc/intel-ipmi-oem)
3268. [OpenBMC BMCWeb](https://github.com/openbmc/bmcweb)