xref: /openbmc/phosphor-logging/extensions/openpower-pels/README.md (revision 1d038d58c6c3c70772bb0b306d02f2d1c99b66dc)
1# OpenPower Platform Event Log (PEL) extension
2
3This extension will create PELs for every OpenBMC event log. It is also
4possible to point to the raw PEL to use in the OpenBMC event, and then that
5will be used instead of creating one.
6
7## Contents
8* [Passing in data when creating PELs](#passing-pel-related-data-within-an-openbmc-event-log)
9* [Default UserData sections for BMC created PELs](#default-userdata-sections-for-bmc-created-pels)
10* [The PEL Message Registry](#the-pel-message-registry)
11* [Callouts](#callouts)
12* [Action Flags and Event Type Rules](#action-flags-and-event-type-rules)
13* [D-Bus Interfaces](#d-bus-interfaces)
14* [PEL Retention](#pel-retention)
15* [Adding python3 modules for PEL UserData and SRC parsing](#adding-python3-modules-for-pel-userdata-and-src-parsing)
16* [Fail Boot on Host Errors](#fail-boot-on-host-errors)
17
18## Passing PEL related data within an OpenBMC event log
19
20An error log creator can pass in data that is relevant to a PEL by using
21certain keywords in the AdditionalData property of the event log.
22
23### AdditionalData keywords
24
25#### RAWPEL
26
27This keyword is used to point to an existing PEL in a binary file that should
28be associated with this event log.  The syntax is:
29```
30RAWPEL=<path to PEL File>
31e.g.
32RAWPEL="/tmp/pels/pel.5"
33```
34The code will assign its own error log ID to this PEL, and also update the
35commit timestamp field to the current time.
36
37#### POWER_THERMAL_CRITICAL_FAULT
38
39This keyword is used to set the power fault bit in PEL. The syntax is:
40```
41POWER_THERMAL_CRITICAL_FAULT=<FLAG>
42e.g.
43POWER_THERMAL_CRITICAL_FAULT=TRUE
44```
45
46Note that TRUE is the only value supported.
47
48#### SEVERITY_DETAIL
49
50This is used when the passed in event log severity determines the PEL
51severity and a more granular PEL severity is needed beyond what the normal
52event log to PEL severity conversion could give.
53
54The syntax is:
55```
56SEVERITY_DETAIL=<SEVERITY_TYPE>
57e.g.
58SEVERITY_DETAIL=SYSTEM_TERM
59```
60Option Supported:
61- SYSTEM_TERM, changes the Severity value from 0x50 to 0x51
62
63#### ESEL
64
65This keyword's data contains a full PEL in string format.  This is how hostboot
66sends down PELs when it is configured in IPMI communication mode.  The PEL is
67handled just like the PEL obtained using the RAWPEL keyword.
68
69The syntax is:
70
71```
72ESEL=
73"00 00 df 00 00 00 00 20 00 04 12 01 6f aa 00 00 50 48 00 30 01 00 33 00 00..."
74```
75
76Note that there are 16 bytes of IPMI SEL data before the PEL data starts.
77
78#### _PID
79
80This keyword that contains the application's PID is added automatically by the
81phosphor-logging daemon when the `commit` or `report` APIs are used to create
82an event log, but not when the `Create` D-Bus method is used.  If a caller of
83the `Create` API wishes to have their PID captured in the PEL this should be
84used.
85
86This will be added to the PEL in a section of type User Data (UD), along with
87the application name it corresponds to.
88
89The syntax is:
90```
91_PID=<PID of application>
92e.g.
93_PID="12345"
94```
95
96#### CALLOUT_INVENTORY_PATH
97
98This is used to pass in an inventory item to use as a callout.  See [here for
99details](#passing-callouts-in-with-the-additionaldata-property)
100
101#### CALLOUT_PRIORITY
102
103This can be used along with CALLOUT_INVENTORY_PATH to specify the priority of
104that FRU callout. If not specified, the default priority is "H"/High Priority.
105
106The possible values are:
107- "H": High Priority
108- "M": Medium Priority
109- "L": Low Priority
110
111See [here for details](#passing-callouts-in-with-the-additionaldata-property)
112
113#### CALLOUT_DEVICE_PATH with CALLOUT_ERRNO
114
115This is used to pass in a device path to create callouts from.  See [here for
116details](#passing-callouts-in-with-the-additionaldata-property)
117
118#### CALLOUT_IIC_BUS with CALLOUT_IIC_ADDR and CALLOUT_ERRNO
119
120This is used to pass in an I2C bus and address to create callouts from.  See
121[here for details](#passing-callouts-in-with-the-additionaldata-property)
122
123#### PEL_SUBSYSTEM
124This keyword is used to pass in the subsystem that should be associated with
125this event log. The syntax is:
126PEL_SUBSYSTEM=<subsystem value in hex>
127e.g.
128PEL_SUBSYSTEM=0x20
129
130### FFDC Intended For UserData PEL sections
131
132When one needs to add FFDC into the PEL UserData sections, the
133`CreateWithFFDCFiles` D-Bus method on the `xyz.openbmc_project.Logging.Create`
134interface must be used when creating a new event log. This method takes a list
135of files to store in the PEL UserData sections.
136
137That API is the same as the 'Create' one, except it has a new parameter:
138
139```
140std::vector<std::tuple<enum[FFDCFormat],
141                       uint8_t,
142                       uint8_t,
143                       sdbusplus::message::unix_fd>>
144```
145
146Each entry in the vector contains a file descriptor for a file that will
147be stored in a unique UserData section.  The tuple's arguments are:
148
149- enum[FFDCFormat]: The data format type, the options are:
150    - 'JSON'
151        - The parser will use nlohmann::json\'s pretty print
152    - 'CBOR'
153        - The parser will use nlohmann::json\'s pretty print
154    - 'Text'
155        - The parser will output ASCII text
156    - 'Custom'
157        - The parser will hexdump the data, unless there is a parser registered
158          for this component ID and subtype.
159- uint8_t: subType
160  - Useful for the 'custom' type.  Not used with the other types.
161- uint8_t: version
162  - The version of the data.
163  - Used for the custom type.
164  - Not planning on using for JSON/BSON unless a reason to do so appears.
165- unixfd - The file descriptor for the opened file that contains the
166    contents.  The file descriptor can be closed and the file can be deleted if
167    desired after the method call.
168
169An example of saving JSON data to a file and getting its file descriptor is:
170
171```
172nlohmann::json json = ...;
173auto jsonString = json.dump();
174FILE* fp = fopen(filename, "w");
175fwrite(jsonString.data(), 1, jsonString.size(), fp);
176int fd = fileno(fp);
177```
178
179Alternatively, 'open()' can be used to obtain the file descriptor of the file.
180
181Upon receiving this data, the PEL code will create UserData sections for each
182entry in that vector with the following UserData fields:
183
184- Section header component ID:
185    - If the type field from the tuple is "custom", use the component ID from
186      the message registry.
187    - Otherwise, set the component ID to the phosphor-logging component ID so
188      that the parser knows to use the built in parsers (e.g. json) for the
189      type.
190- Section header subtype: The subtype field from the tuple.
191- Section header version: The version field from the tuple.
192- Section data: The data from the file.
193
194If there is a peltool parser registered for the custom type (method is TBD),
195that will be used by peltool to print the data, otherwise it will be hexdumped.
196
197Before adding each of these UserData sections, a check will be done to see if
198the PEL size will remain under the maximum size of 16KB.  If not, the UserData
199section will be truncated down enough so that it will fit into the 16KB.
200
201## Default UserData sections for BMC created PELs
202
203The extension code that creates PELs will add these UserData sections to every
204PEL:
205
206- The AdditionalData property contents
207  - If the AdditionalData property in the OpenBMC event log has anything in it,
208    it will be saved in a UserData section as a JSON string.
209
210- System information
211  - This section contains various pieces of system information, such as the
212    full code level and the BMC, chassis, and host state properties.
213
214## The PEL Message Registry
215
216The PEL message registry is used to create PELs from OpenBMC event logs.
217Documentation can be found [here](registry/README.md).
218
219## Callouts
220
221A callout points to a FRU, a symbolic FRU, or an isolation procedure.  There
222can be from zero to ten of them in each PEL, where they are located in the SRC
223section.
224
225There are a few different ways to add callouts to a PEL.  In all cases, the
226callouts will be sorted from highest to lowest priority within the PEL after
227they are added.
228
229### Passing callouts in with the AdditionalData property
230
231The PEL code can add callouts based on the values of special entries in the
232AdditionalData event log property.  They are:
233
234- CALLOUT_INVENTORY_PATH
235
236    This keyword is used to call out a single FRU by passing in its D-Bus
237    inventory path.  When the PEL code sees this, it will create a single FRU
238    callout, using the VPD properties (location code, FN, CCIN) from that
239    inventory item.  If that item is not a FRU itself and does not have a
240    location code, it will keep searching its parents until it finds one that
241    is.
242
243    The priority of the FRU callout will be high, unless the CALLOUT_PRIORITY
244    keyword is also present and contains a different priority in which case it
245    will be used instead.  This can be useful when a maintenance procedure with
246    a high priority callout is specified for this error in the message registry
247    and the FRU callout needs to have a different priority.
248
249    ```
250    CALLOUT_INVENTORY_PATH=
251    "/xyz/openbmc_project/inventory/system/chassis/motherboard"
252    ```
253
254- CALLOUT_DEVICE_PATH with CALLOUT_ERRNO
255
256    These keywords are required as a pair to indicate faulty device
257    communication, usually detected by a failure accessing a device at that
258    sysfs path.  The PEL code will use a data table generated by the MRW to map
259    these device paths to FRU callout lists.  The errno value may influence the
260    callout.
261
262    I2C, FSI, FSI-I2C, and FSI-SPI paths are supported.
263
264    ```
265    CALLOUT_DEVICE_PATH="/sys/bus/i2c/devices/3-0069"
266    CALLOUT_ERRNO="2"
267    ```
268
269- CALLOUT_IIC_BUS with CALLOUT_IIC_ADDR and CALLOUT_ERRNO
270
271    These 3 keywords can be used to callout a failing I2C device path when the
272    full device path isn't known.  It is similar to CALLOUT_DEVICE_PATH in that
273    it will use data tables generated by the MRW to create the callouts.
274
275    CALLOUT_IIC_BUS is in the form "/dev/i2c-X" where X is the bus number, or
276    just the bus number by itself.
277    CALLOUT_IIC_ADDR is the 7 bit address either as a decimal or a hex number
278    if preceded with a "0x".
279
280    ```
281    CALLOUT_IIC_BUS="/dev/i2c-7"
282    CALLOUT_IIC_ADDR="81"
283    CALLOUT_ERRNO=62
284    ```
285
286### Defining callouts in the message registry
287
288Callouts can be completely defined inside that error's definition in the PEL
289message registry.  This method allows the callouts to vary based on the system
290type or on any AdditionalData item.
291
292At a high level, this involves defining a callout section inside the registry
293entry that contain the location codes or procedure names to use, along with
294their priority.  If these can vary based on system type, the type provided by
295the entity manager will be one of the keys.  If they can also depend on an
296AdditionalData entry, then that will also be a key.
297
298See the message registry [README](registry/README.md) and
299[schema](registry/schema/schema.json) for the details.
300
301### Using the message registry along with CALLOUT_ entries
302
303If the message registry entry contains a callout definition and the event log
304also contains one of aforementioned CALLOUT keys in the AdditionalData
305property, then the PEL code will first add the callouts stemming from the
306CALLOUT items, followed by the callouts from the message registry.
307
308### Specifying multiple callouts using JSON format FFDC files
309
310Multiple callouts can be passed in by the creator at the time of PEL creation.
311This is done by specifying them in a JSON file that is then passed in as an
312[FFDC file](#ffdc-intended-for-userdata-pel-sections).  The JSON will still be
313added into a PEL UserData section for debug.
314
315To specify that an FFDC file contains callouts, the format value for that FFDC
316entry must be set to JSON, and the subtype field must be set to 0xCA:
317
318```
319using FFDC = std::tuple<CreateIface::FFDCFormat,
320                        uint8_t,
321                        uint8_t,
322                        sdbusplus::message::unix_fd>;
323
324FFDC ffdc{
325    CreateIface::FFDCFormat::JSON,
326    0xCA, // Callout subtype
327    0x01, // Callout version, set to 0x01
328    fd};
329```
330
331The JSON contains an array of callouts that must be in order of highest
332priority to lowest, with a maximum of 10.  Any callouts after the 10th will
333just be thrown away as there is no room for them in the PEL. The format looks
334like:
335
336```
337[
338    {
339        // First callout
340    },
341    {
342        // Second callout
343    },
344    {
345        // Nth callout
346    }
347]
348```
349
350A callout entry can be a normal hardware callout, a maintenance procedure
351callout, or a symbolic FRU callout.  Each callout must contain a Priority
352field, where the possible values are:
353
354* "H" = High
355* "M" = Medium
356* "A" = Medium Group A
357* "B" = Medium Group B
358* "C" = Medium Group C
359* "L" = Low
360
361Either unexpanded location codes or D-Bus inventory object paths can be used to
362specify the called out part.  An unexpanded location code does not have the
363system VPD information embedded in it, and the 'Ufcs-' prefix is optional (so
364can be either Ufcs-P1 or just P1).
365
366#### Normal hardware FRU callout
367
368Normal hardware callouts must contain either the location code or inventory
369path, and priority.  Even though the PEL code doesn't do any guarding or
370deconfiguring itself, it needs to know if either of those things occurred as
371there are status bits in the PEL to reflect them.  The Guarded and Deconfigured
372fields are used for this.  Those fields are optional and if omitted then their
373values will be false.
374
375When the inventory path of a sub-FRU is passed in, the PEL code will put the
376location code of the parent FRU into the callout.
377
378```
379{
380    "LocationCode": "P0-C1",
381    "Priority": "H"
382}
383
384{
385    "InventoryPath": "/xyz/openbmc_project/inventory/motherboard/cpu0/core5",
386    "Priority": "H",
387    "Deconfigured": true,
388    "Guarded": true
389}
390
391```
392
393MRUs (Manufacturing Replaceable Units) are 4 byte numbers that can optionally
394be added to callouts to specify failing devices on a FRU.  These may be used
395during the manufacturing test process, where there may be the ability to do
396these replacements.  There can be up to 15 MRUs, each with its own priority,
397embedded in a callout.  The possible priority values match the FRU priority
398values.
399
400Note that since JSON only supports numbers in decimal and not in hex, MRU IDs
401will show up as decimal when visually inspecting the JSON.
402
403```
404{
405    "LocationCode": "P0-C1",
406    "Priority": "H",
407    "MRUs": [
408        {
409            "ID": 1234,
410            "Priority": "H"
411        },
412        {
413            "ID": 5678,
414            "Priority": "H"
415        }
416    ]
417}
418```
419
420#### Maintenance procedure callout
421
422The LocationCode field is not used with procedure callouts.  Only the first 7
423characters of the Procedure field will be used by the PEL.
424
425```
426{
427    "Procedure": "PRONAME",
428    "Priority": "H"
429}
430```
431
432#### Symbolic FRU callout
433
434Only the first seven characters of the SymbolicFRU field will be used by the PEL.
435
436If the TrustedLocationCode field is present and set to true, this means the
437location code may be used to turn on service indicators, so the LocationCode
438field is required.  If TrustedLocationCode is false or missing, then the
439LocationCode field is optional.
440
441```
442{
443    "TrustedLocationCode": true,
444    "Location Code": "P0-C1",
445    "Priority": "H",
446    "SymbolicFRU": "FRUNAME"
447}
448```
449
450## `Action Flags` and `Event Type` Rules
451
452The `Action Flags` and `Event Type` PEL fields are optional in the message
453registry, and if not present the code will set them based on certain rules
454layed out in the PEL spec.
455
456These rules are:
4571. Always set the `Report` flag, unless the `Do Not Report` flag is already on.
4582. Always clear the `SP Call Home` flag, as that feature isn't supported.
4593. If the severity is `Non-error Event`:
460    - Clear the `Service Action` flag.
461    - Clear the `Call Home` flag.
462    - If the `Event Type` field is `Not Applicable`, change it to `Information
463      Only`.
464    - If the `Event Type` field is `Information Only` or `Tracing`, set the
465      `Hidden` flag.
4664. If the severity is `Recovered`:
467    - Set the `Hidden` flag.
468    - Clear the `Service Action` flag.
469    - Clear the `Call Home` flag.
4705. For all other severities:
471    - Clear the `Hidden` flag.
472    - Set the `Service Action` flag.
473    - Set the `Call Home` flag.
474
475Additional rules may be added in the future if necessary.
476
477## D-Bus Interfaces
478
479See the org.open_power.Logging.PEL interface definition for the most up to date
480information.
481
482## PEL Retention
483
484The PEL repository is allocated a set amount of space on the BMC.  When that
485space gets close to being full, the code will remove a percentage of PELs to
486make room for new ones.  In addition, the code will keep a cap on the total
487number of PELs allowed.  Note that removing a PEL will also remove the
488corresponding OpenBMC event log.
489
490The disk capacity limit is set to 20MB, and the number limit is 3000.
491
492The rules used to remove logs are listed below.  The checks will be run after a
493PEL has been added and the method to create the PEL has returned to the caller,
494i.e. run when control gets back to the event loop.
495
496### Removal Algorithm
497
498If the size used is 95% or under of the allocated space and under the limit on
499the number of PELs, nothing further needs to be done, otherwise continue and
500run all 5 of the following steps.  Each step itself only deletes PELs until it
501meets its requirement and then it stops.
502
503The steps are:
504
5051. Remove BMC created informational PELs until they take up 15% or less of the
506   allocated space.
507
5082. Remove BMC created non-informational PELs until they take up 30% or less of
509   the allocated space.
510
5113. Remove non-BMC created informational PELs until they take up 15% or less of
512   the allocated space.
513
5144. Remove non-BMC created non-informational PELs until they take up 30% or less
515   of the allocated space.
516
5175. After the previous 4 steps are complete, if there are still more than the
518   maximum number of PELs, remove PELs down to 80% of the maximum.
519
520PELs with associated guard records will never be deleted.  Each step above
521makes the following 4 passes, stopping as soon as its limit is reached:
522
523Pass 1. Remove HMC acknowledged PELs.<br>
524Pass 2. Remove OS acknowledged PELs.<br>
525Pass 3. Remove PHYP acknowledged PELs.<br>
526Pass 4. Remove all PELs.
527
528After all these steps, disk capacity will be at most 90% (15% + 30% + 15% +
52930%).
530
531## Adding python3 modules for PEL UserData and SRC parsing
532
533In order to support python3 modules for the parsing of PEL User Data sections
534and to decode SRC data, setuptools is used to import python3 packages from
535external repos to be included in the OpenBMC image.
536```
537Sample layout for setuptools:
538
539setup.py
540src/usr/scom/plugins/ebmc/b0300.py
541src/usr/i2c/plugins/ebmc/b0700.py
542src/build/tools/ebmc/errludP_Helpers.py
543```
544
545`setup.py` is the build script for setuptools. It contains information about the
546package (such as the name and version) as well as which code files to include.
547
548The setup.py template to be used for eBMC User Data parsers:
549```
550import os.path
551from setuptools import setup
552
553# To update this dict with new key/value pair for every component added
554# Key: The package name to be installed as
555# Value: The path containing the package's python modules
556dirmap = {
557    "b0300": "src/usr/scom/plugins/ebmc",
558    "b0700": "src/usr/i2c/plugins/ebmc",
559    "helpers": "src/build/tools/ebmc"
560}
561
562# All packages will be installed under 'udparsers' namespace
563def get_package_name(dirmap_key):
564    return "udparsers.{}".format(dirmap_key)
565
566def get_package_dirent(dirmap_item):
567    package_name = get_package_name(dirmap_item[0])
568    package_dir = dirmap_item[1]
569    return (package_name, package_dir)
570
571def get_packages():
572    return map(get_package_name, dirmap.keys())
573
574def get_package_dirs():
575    return map(get_package_dirent, dirmap.items())
576
577setup(
578        name="Hostboot",
579        version="0.1",
580        packages=list(get_packages()),
581        package_dir=dict(get_package_dirs())
582)
583```
584- User Data parser module
585  - Module name: `xzzzz.py`, where `x` is the Creator Subsystem from the
586    Private Header section (in ASCII) and `zzzz` is the 2 byte Component ID
587    from the User Data section itself (in HEX). All should be converted to
588    lowercase.
589    - For example: `b0100.py` for Hostboot created UserData with CompID 0x0100
590  - Function to provide: `parseUDToJson`
591    - Argument list:
592      1. (int) Sub-section type
593      2. (int) Section version
594      3. (memoryview): Data
595    - Return data:
596      1. (str) JSON string
597
598  - Sample User Data parser module:
599    ```
600    import json
601    def parseUDToJson(subType, ver, data):
602        d = dict()
603        ...
604        # Parse and populate data into dictionary
605        ...
606        jsonStr = json.dumps(d)
607        return jsonStr
608    ```
609- SRC parser module
610  - Module name: `xsrc.py`, where `x` is the Creator Subsystem from the
611    Private Header section (in ASCII, converted to lowercase).
612    - For example: `bsrc.py` for Hostboot generated SRCs
613  - Function to provide: `parseSRCToJson`
614    - Argument list:
615      1. (str) Refcode ASCII string
616      2. (str) Hexword 2
617      3. (str) Hexword 3
618      4. (str) Hexword 4
619      5. (str) Hexword 5
620      6. (str) Hexword 6
621      7. (str) Hexword 7
622      8. (str) Hexword 8
623      9. (str) Hexword 9
624    - Return data:
625      1. (str) JSON string
626
627  - Sample SRC parser module:
628    ```
629    import json
630    def parseSRCToJson(ascii_str, word2, word3, word4, word5, word6, word7, \
631                       word8, word9):
632        d = dict({'A': 1, 'B': 2})
633        ...
634        # Decode SRC data into dictionary
635        ...
636        jsonStr = json.dumps(d)
637        return jsonStr
638    ```
639
640## Fail Boot on Host Errors
641
642The fail boot on hw error [design][1] provides a function where a system owner
643can tell the firmware to fail the boot of a system if a BMC phosphor-logging
644event has a hardware callout in it.
645
646It is required that when this fail boot on hardware error setting is enabled,
647that the BMC fail the boot for **any** error from the host which satisfies the
648following criteria:
649- not SeverityType::nonError
650- has a callout of any kind from the `FailingComponentType` structure
651
652## Self Boot Engine(SBE) First Failure Data Capture(FFDC) Support
653
654During SBE chip-op failure SBE creates FFDC with custom data format.
655SBE FFDC contains different packets, which include SBE internal failure related
656Trace and user data also Hardware procedure failure FFDC created by FAPI
657infrastructure. PEL infrastructure provides support to process SBE FFDC packets
658created by FAPI infrastructure during hardware procedure execution failures,
659also add callouts, user data section information based on FAPI processing
660in case non FAPI based failure, just keeps the raw FFDC data in the user section
661to support SBE parser plugins.
662
663
664CreatePELWithFFDCFiles D-Bus method on the `org.open_power.Logging.PEL`
665interface must be used when creating a new event log.
666
667To specify that an FFDC file contains SBE FFDC, the format value for that FFDC
668entry must be set to "custom", and the subtype field must be set to 0xCB:
669
670```
671using FFDC = std::tuple<CreateIface::FFDCFormat,
672                        uint8_t,
673                        uint8_t,
674                        sdbusplus::message::unix_fd>;
675
676FFDC ffdc{
677     CreateIface::FFDCFormat::custom,
678     0xCB, // SBE FFDC subtype
679     0x01, // SBE FFDC version, set to 0x01
680     fd};
681 ```
682
683"SRC6" Keyword in the additional data section should be populated with below.
684
685  - [0:15] chip position  (hex)
686  - [16:23] command class (hex)
687  - [24:31] command       (hex)
688
689e.g for GetSCOM
690
691   SRC6="0002A201"
692
693Note: "phal" build-time configure option should be "enabled" to enable this
694       feature.
695
696## PEL Archiving
697
698When an OpenBMC event log is deleted its corresponding PEL is moved to
699an archive folder. These archived PELs will be available in BMC dump.
700The archive path: /var/lib/phosphor-logging/extensions/pels/logs/archive.
701
702Highlighted points are:
703- PELs whose corresponding event logs have been deleted will be available
704  in the archive folder.
705- Archive folder size is tracked along with logs folder size and if
706  combined size exceeds warning size all archived PELs will be deleted.
707- Archived PEL logs can be viewed using peltool with flag --archive.
708- If a PEL is deleted using peltool its not archived.
709
710[1]: https://github.com/openbmc/docs/blob/master/designs/fail-boot-on-hw-error.md
711