xref: /openbmc/docs/designs/event-logging.md (revision ba560cc31297caddfc157c540ae9e6d760d630e5)
1# Error and Event Logging
2
3Author: [Patrick Williams][patrick-email] `<stwcx>`
4
5[patrick-email]: mailto:patrick@stwcx.xyz
6
7Other contributors:
8
9Created: May 16, 2024
10
11## Problem Description
12
13There is currently not a consistent end-to-end error and event reporting design
14for the OpenBMC code stack. There are two different implementations, one
15primarily using phosphor-logging and one using rsyslog, both of which have gaps
16that a complete solution should address. This proposal is intended to be an
17end-to-end design handling both errors and tracing events which facilitate
18external management of the system in an automated and maintainable manner.
19
20## Background and References
21
22### Redfish LogEntry and Message Registry
23
24In Redfish, the [`LogEntry` schema][LogEntry] is used for a range of items that
25could be considered "logs", but one such use within OpenBMC is for an equivalent
26of the IPMI "System Event Log (SEL)".
27
28The IPMI SEL is the location where the BMC can collect errors and events,
29sometimes coming from other entities, such as the BIOS. Examples of these might
30be "DIMM-A0 encountered an uncorrectable ECC error" or "System boot successful".
31These SEL records are exposed as human readable strings, either natively by a
32OEM SEL design or by tools such as `ipmitool`, which are typically unique to
33each system or manufacturer, and could hypothethically change with a BMC or
34firmware update, and are thus difficult to create automated tooling around. Two
35different vendors might use different strings to represent a critical
36temperature threshold exceeded: ["temperature threshold exceeded"][HPE-Example]
37and ["Temperature #0x30 Upper Critical going high"][Oracle-Example]. There is
38also no mechanism with IPMI to ask the machine "what are all of the SELs you
39might create".
40
41In order to solve two aspects of this problem, listing of possible events and
42versioning, Redfish has Message Registries. A message registry is a versioned
43collection of all of the error events that a system could generate and hints as
44to how they might be parsed and displayed to a user. An [informative
45reference][Registry-Example] from the DMTF gives this example:
46
47```json
48{
49  "@odata.type": "#MessageRegistry.v1_0_0.MessageRegistry",
50  "Id": "Alert.1.0.0",
51  "RegistryPrefix": "Alert",
52  "RegistryVersion": "1.0.0",
53  "Messages": {
54    "LanDisconnect": {
55      "Description": "A LAN Disconnect on %1 was detected on system %2.",
56      "Message": "A LAN Disconnect on %1 was detected on system %2.",
57      "Severity": "Warning",
58      "NumberOfArgs": 2,
59      "Resolution": "None"
60    }
61  }
62}
63```
64
65This example defines an event, `Alert.1.0.LanDisconnect`, which can record the
66disconnect state of a network device and contains placeholders for the affected
67device and system. When this event occurs, there might be a `LogEntry` recorded
68containing something like:
69
70```json
71{
72  "Message": "A LAN Disconnnect on EthernetInterface 1 was detected on system /redfish/v1/Systems/1.",
73  "MessageId": "Alert.1.0.LanDisconnect",
74  "MessageArgs": ["EthernetInterface 1", "/redfish/v1/Systems/1"]
75}
76```
77
78The `Message` contains a human readable string which was created by applying the
79`MessageArgs` to the placeholders from the `Message` field in the registry.
80System management software can rely on the message registry (referenced from the
81`MessageId` field in the `LogEntry`) and `MessageArgs` to avoid needing to
82perform string processing for reacting to the event.
83
84Within OpenBMC, there is currently a [limited design][existing-design] for this
85Redfish feature and it requires inserting specially formed Redfish-specific
86logging messages into any application that wants to record these events, tightly
87coupling all applications to the Redfish implementation. It has also been
88observed that these [strings][app-example], when used, are often out of date
89with the [message registry][obmc-registry-example] advertised by `bmcweb`. Some
90maintainers have rejected adding new Redfish-specific logging messages to their
91applications.
92
93[LogEntry]:
94  https://github.com/openbmc/bmcweb/blob/de0c960c4262169ea92a4b852dd5ebbe3810bf00/redfish-core/schema/dmtf/json-schema/LogEntry.v1_16_0.json
95[HPE-Example]:
96  https://support.hpe.com/hpesc/public/docDisplay?docId=sd00002092en_us&docLocale=en_US&page=GUID-D7147C7F-2016-0901-06CE-000000000422.html
97[Oracle-Example]:
98  https://docs.oracle.com/cd/E19464-01/820-6850-11/IPMItool.html#50602039_63068
99[Registry-Example]:
100  https://www.dmtf.org/sites/default/files/Redfish%20School%20-%20Events_0.pdf
101[existing-design]:
102  https://github.com/openbmc/docs/blob/master/architecture/redfish-logging-in-bmcweb.md
103[app-example]:
104  https://github.com/openbmc/phosphor-post-code-manager/blob/f2da78deb3a105c7270f74d9d747c77f0feaae2c/src/post_code.cpp#L143
105[obmc-registry-example]:
106  https://github.com/openbmc/bmcweb/blob/4ba5be51e3fcbeed49a6a312b4e6b2f1ea7447ba/redfish-core/include/registries/openbmc.json#L5
107
108### Existing phosphor-logging implementation
109
110**Note**: While the word 'exception' is used in this section, the existing (and
111proposed) types can be used by applications and execution contexts with
112exceptions disabled. They are 'exceptions' because they do inherit from
113`std::exception` and there is support in the `sdbusplus` bindings for them to be
114used in exception handling.
115
116The `sdbusplus` bindings have the capability to define new C++ exception types
117which can be thrown by a DBus server and turned into an error response to the
118client. `phosphor-logging` extended this to also add metadata associated to the
119log type. See the following example error definitions and usages.
120
121`sdbusplus` error binding definition (in
122`xyz/openbmc_project/Certs.errors.yaml`):
123
124```yaml
125- name: InvalidCertificate
126  description: Invalid certificate file.
127```
128
129`phosphor-logging` metadata definition (in
130`xyz/openbmc_project/Certs.metadata.yaml`):
131
132```yaml
133- name: InvalidCertificate
134  meta:
135    - str: "REASON=%s"
136      type: string
137```
138
139Application code reporting an error:
140
141```cpp
142elog<InvalidCertificate>(Reason("Invalid certificate file format"));
143// or
144report<InvalidCertificate>(Reason("Existing certificate file is corrupted"));
145```
146
147In this sample, an error named
148`xyz.openbmc_project.Certs.Error.InvalidCertificate` has been defined, which can
149be sent between applications as a DBus response. The `InvalidCertificate` is
150expected to have additional metadata `REASON` which is a string. The two APIs
151`elog` and `report` have slightly different behaviors: `elog` throws an
152exception which can either result in an error DBus result or be handled
153elsewhere in the application, while `report` sends the event directly to
154`phosphor-logging`'s daemon for recording. As a side-effect of both calls, the
155metadata is inserted into the `systemd` journal.
156
157When an error is sent to the `phosphor-logging` daemon, it will:
158
1591. Search back through the journal for recorded metadata associated with the
160   event (this is a relative slow operation).
1612. Create an [`xyz.openbmc_project.Logging.Entry`][Logging-Entry] DBus object
162   with the associated data extracted from the journal.
1633. Persist a serialized version of the object.
164
165Within `bmcweb` there is support for translating
166`xyz.openbmc_project.Logging.Entry` objects advertised by `phosphor-logging`
167into Redfish `LogEntries`, but this support does not reference a Message
168Registry. This makes the events of limited utility for consumption by system
169management software, as it cannot know all of the event types and is left to
170perform (hand-coded) regular-expressions to extract any information from the
171`Message` field of the `LogEntry`. Furthermore, these regular-expressions are
172likely to become outdated over time as internal OpenBMC error reporting
173structure, metadata, or message strings evolve.
174
175[Logging-Entry]:
176  https://github.com/openbmc/phosphor-dbus-interfaces/blob/9012243e543abdc5851b7e878c17c991b2a2a8b7/yaml/xyz/openbmc_project/Logging/Entry.interface.yaml#L1
177
178### Issues with the Status Quo
179
180- There are two different implementations of error logging, neither of which are
181  both complete and fully accepted by maintainers. These implementations also do
182  not cover tracing events.
183
184- The `REDFISH_MESSAGE_ID` log approach leads to differences between the Redfish
185  Message Registry and the reporting application. It also requires every
186  application to be "Redfish aware" which limits decoupling between applications
187  and external management interfaces. This also leaves gaps for reporting errors
188  in different management interfaces, such as inband IPMI and PLDM. The approach
189  also does not provide comple-time assurance of appropriate metadata
190  collection, which can lead to producing code being out-of-date with the
191  message registry definitions.
192
193- The `phosphor-logging` approach does not provide compile-time assurance of
194  appropriate metadata collection and requires expensive daemon processing of
195  the `systemd` journal on each error report, which limits scalability.
196
197- The `sdbusplus` bindings for error reporting do not currently handle lossless
198  transmission of errors between DBus servers and clients.
199
200- Similar applications can result in different Redfish `LogEntry` for the same
201  error scenario. This has been observed in sensor threshold exceeded events
202  between `dbus-sensors`, `phosphor-hwmon`, `phosphor-virtual-sensor`, and
203  `phosphor-health-monitor`. One cause of this is two different error reporting
204  approaches and disagreements amongst maintainers as to the preferred approach.
205
206## Requirements
207
208- Applications running on the BMC must be able to report errors and failure
209  which are persisted and available for external system management through
210  standards such as Redfish.
211  - These errors must be structured, versioned, and the complete set of errors
212    able to be created by the BMC should be available at built-time of a BMC
213    image.
214  - The set of errors, able to be created by the BMC, must be able to be
215    transformed into relevant data sets, such as Redfish Message Registries.
216    - For Redfish, the transformation must comply with the Redfish standard
217      requirements, such as conforming to semantic versioning expectations.
218    - For Redfish, the transformation should allow mapping internally defined
219      events to pre-existing Redfish Message Registries for broader
220      compatibility.
221    - For Redfish, the implementation must also support the EventService
222      mechanics for push-reporting.
223  - Errors reported by the BMC should contain sufficient information to allow
224    service of the system for these failures, either by humans or automation
225    (depending on the individual system requirements).
226
227- Applications running on the BMC should be able to report important tracing
228  events relevant to system management and/or debug, such as the system
229  successfully reaching a running state.
230  - All requirements relevant to errors are also applicable to tracing events.
231  - The implementation must have a mechanism for vendors to be able to disable
232    specific tracing events to conform to their own system design requirements.
233
234- Applications running on the BMC should be able to determine when a previously
235  reported error is no longer relevant and mark it as "resolved", while
236  maintaining the persistent record for future usages such as debug.
237
238- The BMC should provide a mechanism for managed entities within the server to
239  report their own errors and events. Examples of managed entities would be
240  firmware, such as the BIOS, and satellite management controllers.
241
242- The implementation on the BMC should scale to a minimum of
243  [10,000][error-discussion] error and events without impacting the BMC or
244  managed system performance.
245
246- The implementation should provide a mechanism to allow OEM or vendor
247  extensions to the error and event definitions (and generated artifacts such as
248  the Redfish Message Registry) for usage in closed-source or non-upstreamed
249  code. These extensions must be clearly identified, in all interfaces, as
250  vendor-specific and not be tied to the OpenBMC project.
251
252- APIs to implement error and event reporting should have good ergonomics. These
253  APIs must provide compile-time identification, for applicable programming
254  languages, of call sites which do not conform to the BMC error and event
255  specifications.
256  - The generated error classes and APIs should not require exceptions but
257    should also integrate with the `sdbusplus` client and server bindings, which
258    do leverage exceptions.
259
260[error-discussion]:
261  https://discord.com/channels/775381525260664832/855566794994221117/867794201897992213
262
263## Proposed Design
264
265The proposed design has a few high-level design elements:
266
267- Consolidate the `sdbusplus` and `phosphor-logging` implementation of error
268  reporting; expand it to cover tracing events; improve the ergonomics of the
269  associated APIs and add compile-time checking of missing metadata.
270
271- Add APIs to `phosphor-logging` to enable daemons to easily look up their own
272  previously reported events (for marking as resolved).
273
274- Add to `phosphor-logging` a compile-time mechanism to disable recording of
275  specific tracing events for vendor-level customization.
276
277- Generate a Redfish Message Registry for all error and events defined in
278  `phosphor-dbus-interfaces`, using binding generators from `sdbusplus`. Enhance
279  `bmcweb` implementation of the `Logging.Entry` to `LogEvent` transformation to
280  cover the Redfish Message Registry and `phosphor-logging` enhancements;
281  Leverage the Redfish `LogEntry.DiagnosticData` field to provide a
282  Base64-encoded JSON representation of the entire `Logging.Entry` for
283  additional diagnostics [[does this need to be optional?]]. Add support to the
284  `bmcweb` EventService implementation to support `phosphor-logging`-hosted
285  events.
286
287### `sdbusplus`
288
289The `Foo.errors.yaml` content will be combined with the content formerly in the
290`Foo.metadata.yaml` files specified by `phosphor-logging` and specified by a new
291file type `Foo.events.yaml`. This `Foo.events.yaml` format will cover both the
292current `error` and `metadata` information as well as augment with additional
293information necessary to generate external facing datasets, such as Redfish
294Message Registries. The current `Foo.errors.yaml` and `Foo.metadata.yaml` files
295will be deprecated as their usage is replaced by the new format.
296
297The `sdbusplus` library will be enhanced to provide the following:
298
299- JSON serialization and de-serialization of generated exception types with
300  their assigned metadata; assignment of the JSON serialization to the `message`
301  field of `sd_bus_error_set` calls when errors are returned from DBus server
302  calls.
303
304- A facility to register exception types, at library load time, with the
305  `sdbusplus` library for automatic conversion back to C++ exception types in
306  DBus clients.
307
308The binding generator(s) will be expanded to do the following:
309
310- Generate complete C++ exception types, with compile-time checking of missing
311  metadata and JSON serialization, for errors and events. Metadata can be of one
312  of the following types:
313  - size-type and signed integer
314  - floating-point number
315  - string
316  - DBus object path
317
318- Generate a format that `bmcweb` can use to create and populate a Redfish
319  Message Registry, and translate from `phosphor-logging` to Redfish `LogEntry`
320  for a set of errors and events
321
322For general users of `sdbusplus` these changes should have no impact, except for
323the availability of new generated exception types and that specialized instances
324of `sdbusplus::exception::generated_exception` will become available in DBus
325clients.
326
327### `phosphor-dbus-interfaces`
328
329Refactoring will be done to migrate existing `Foo.metadata.yaml` and
330`Foo.errors.yaml` content to the `Foo.events.yaml` as migration is done by
331applications. Minor changes will take place to utilize the new binding
332generators from `sdbusplus`. A small library enhancement will be done to
333register all generated exception types with `sdbusplus`. Future contributors
334will be able to contribute new error and tracing event definitions.
335
336### `phosphor-logging`
337
338> TODO: Should a tracing event be a `Logging.Entry` with severity of
339> `Informational` or should they be a new type, such as `Logging.Event` and
340> managed separately. The `phosphor-logging` default `meson.options` have
341> `error_cap=200` and `error_info_cap=10`. If we increase the total number of
342> events allowed to 10K, the majority of them are likely going to be information
343> / tracing events.
344
345The `Logging.Entry` interface's `AdditionalData` property should change to
346`dict[string, variant[string,int64_t,size_t,object_path]]`.
347
348The `Logging.Create` interface will have a new method added:
349
350```yaml
351- name: CreateEntry
352  parameters:
353    - name: Message
354      type: string
355    - name: Severity
356      type: enum[Logging.Entry.Level]
357    - name: AdditionalData
358      type: dict[string, variant[string,int64_t,size_t,object_path]]
359    - name: Hint
360      type: string
361      default: ""
362  returns:
363    - name: Entry
364      type: object_path
365```
366
367The `Hint` parameter is used for daemons to be able to query for their
368previously recorded error, for marking as resolved. These strings need to be
369globally unique and are suggested to be of the format `"<service_name>:<key>"`.
370
371A `Logging.SearchHint` interface will be created, which will be recorded at the
372same object path as a `Logging.Entry` when the `Hint` parameter was not an empty
373string:
374
375```yaml
376- property: Hint
377  type: string
378```
379
380The `Logging.Manager` interface will be added with a single method:
381
382```yaml
383- name: FindEntry
384  parameters:
385    - name: Hint
386      type: String
387  returns:
388    - name: Entry
389      type: object_path
390  errors:
391    - xyz.openbmc_project.Common.ResourceNotFound
392```
393
394A `lg2::commit` API will be added to support the new `sdbusplus` generated
395exception types, calling the new `Logging.Create.CreateEntry` method proposed
396earlier. This new API will support `sdbusplus::bus_t` for synchronous DBus
397operations and both `sdbusplus::async::context_t` and
398`sdbusplus::asio::connection` for asynchronous DBus operations.
399
400There are outstanding performance concerns with the `phosphor-logging`
401implementation that may impact the ability for scaling to 10,000 event records.
402This issue is expected to be self-contained within `phosphor-logging`, except
403for potential future changes to the log-retrieval interfaces used by `bmcweb`.
404In order to decouple the transition to this design, by callers of the logging
405APIs, from the experimentation and improvements in `phosphor-logging`, we will
406add a compile option and Yocto `DISTRO_FEATURE` that can turn `lg2::commit`
407behavior into an `OPENBMC_MESSAGE_ID` record in the journal, along the same
408approach as the previous `REDFISH_MESSAGE_ID`, and corresponding `rsyslog`
409configuration and `bmcweb` support to use these directly. This will allow
410systems which knowingly scale to a large number of event records, using
411`rsyslog` mechanics, the same level of performance. One caveat of this support
412is that the hint and resolution behavior will not exist when that option is
413enabled.
414
415### `bmcweb`
416
417`bmcweb` already has support for build-time conversion from a Redfish Message
418Registry, codified in JSON, to header files it uses to serve the registry; this
419will be expanded to support Redfish Message Registries generated by `sdbusplus`.
420`bmcweb` will add a Meson option for additional message registries, provided
421from bitbake from `phosphor-dbus-interfaces` and vendor-specific event
422definitions as a path to a directory of Message Registry JSONs. Support will
423also be added for adding `phosphor-dbus-interfaces` as a Meson subproject for
424stand-alone testing.
425
426It is desirable for `sdbusplus` to generate a Redfish Message Registry directly,
427leveraging the existing scripts for integration with `bmcweb`. As part of this
428we would like to support mapping a `Logging.Entry` event to an existing
429standardized Redfish event (such as those in the Base registry). The generated
430information must contain the `Logging.Entry::Message` identifier, the
431`AdditionalData` to `MessageArgs` mapping, and the translation from the
432`Message` identifier to the Redfish Message ID (when the Message ID is not from
433"this" registry). In order to facilitate this, we will need to add OEM fields to
434the Redfish Message Registry JSON, which are only used by the `bmcweb`
435processing scripts, to generate the information necessary for this additional
436mapping.
437
438The `xyz.openbmc_project.Logging.Entry` to `LogEvent` conversion needs to be
439enhanced, to utilize these Message Registries, in four ways:
440
4411. A Base64-encoded JSON representation of the `Logging.Entry` will be assigned
442   to the `DiagnosticData` property.
443
4442. If the `Logging.Entry::Message` contains an identifier corresponding to a
445   Registry entry, the `MessageId` property will be set to the corresponding
446   Redfish Message ID. Otherwise, the `Logging.Entry::Message` will be used
447   directly with no further transformation (as is done today).
448
4493. If the `Logging.Entry::Message` contains an identifier corresponding to a
450   Registry entry, the `MessageArgs` property will be filled in by obtaining the
451   corresponding values from the `AdditionalData` dictionary and the `Message`
452   field will be generated from combining these values with the `Message` string
453   from the Registry.
454
4554. A mechanism should be implemented to translate DBus `object_path` references
456   to Redfish Resource URIs. When an `object_path` cannot be translated,
457   `bmcweb` will use a prefix such as `object_path:` in the `MessageArgs` value.
458
459The implementation of `EventService` should be enhanced to support
460`phosphor-logging` hosted events. The implementation of `LogService` should be
461enhanced to support log paging for `phosphor-logging` hosted events.
462
463### `phosphor-sel-logger`
464
465The `phosphor-sel-logger` has a meson option `send-to-logger` which toggles
466between using `phosphor-logging` or the [`REDFISH_MESSAGE_ID`
467mechanism][existing-design]. The `phosphor-logging`-utilizing paths will be
468updated to utilize `phosphor-dbus-interfaces` specified errors and events.
469
470### YAML format
471
472Consider an example file in `phosphor-dbus-interfaces` as
473`yaml/xyz/openbmc_project/Software/Update.events.yaml` with hypothetical errors
474and events:
475
476```yaml
477version: 1.3.1
478
479errors:
480  - name: UpdateFailure
481    severity: critical
482    metadata:
483      - name: TARGET
484        type: string
485        primary: true
486      - name: ERRNO
487        type: int64
488      - name: CALLOUT_HARDWARE
489        type: object_path
490        primary: true
491    en:
492      description: While updating the firmware on a device, the update failed.
493      message: A failure occurred updating {TARGET} on {CALLOUT_HARDWARE}.
494      resolution: Retry update.
495
496  - name: BMCUpdateFailure
497    severity: critical
498    deprecated: 1.0.0
499    en:
500      description: Failed to update the BMC
501    redfish-mapping: OpenBMC.FirmwareUpdateFailed
502
503events:
504  - name: UpdateProgress
505    metadata:
506      - name: TARGET
507        type: string
508        primary: true
509      - name: COMPLETION
510        type: double
511        primary: true
512    en:
513      description: An update is in progress and has reached a checkpoint.
514      message: Updating of {TARGET} is {COMPLETION}% complete.
515```
516
517Each `foo.events.yaml` file would be used to generate both the C++ classes (via
518`sdbusplus`) for exception handling and event reporting, as well as a versioned
519Redfish Message Registry for the errors and events. The [YAML
520schema][yaml-schema] is contained in the sdbusplus repository.
521
522The above example YAML would generate C++ classes similar to:
523
524```cpp
525namespace sdbusplus::errors::xyz::openbmc_project::software::update
526{
527
528class UpdateFailure
529{
530
531    template <typename... Args>
532    UpdateFailure(Args&&... args);
533};
534
535}
536
537namespace sdbusplus::events::xyz::openbmc_project::software::update
538{
539
540class UpdateProgress
541{
542    template <typename... Args>
543    UpdateProgress(Args&&... args);
544};
545
546}
547```
548
549The constructors here are variadic templates because the generated constructor
550implementation will provide compile-time assurance that all of the metadata
551fields have been populated (in any order). To raise an `UpdateFailure` a
552developers might do something like:
553
554```cpp
555// Immediately report the event:
556lg2::commit(UpdateFailure("TARGET", "BMC Flash A", "ERRNO", rc, "CALLOUT_HARDWARE", bmc_object_path));
557// or send it in a dbus response (when using sdbusplus generated binding):
558throw UpdateFailure("TARGET", "BMC Flash A", "ERRNO", rc, "CALLOUT_HARDWARE", bmc_object_path);
559```
560
561If one of the fields, such as `ERRNO` were omitted, a compile failure will be
562raised indicating the first missing field.
563
564[yaml-schema]:
565  https://github.com/openbmc/sdbusplus/blob/master/tools/sdbusplus/schemas/events.schema.yaml
566
567### Versioning Policy
568
569Assume the version follows semantic versioning `MAJOR.MINOR.PATCH` convention.
570
571- Adjusting a description or message should result in a `PATCH` increment.
572- Adding a new error or event, or adding metadata to an existing error or event,
573  should result in a `MINOR` increment.
574- Deprecating an error or event should result in a `MAJOR` increment.
575
576There is [guidance on maintenance][registry-guidance] of the OpenBMC Message
577Registry. We will incorporate that guidance into the equivalent
578`phosphor-dbus-interfaces` policy.
579
580[registry-guidance]:
581  https://github.com/openbmc/bmcweb/blob/master/redfish-core/include/registries/openbmc_message_registry.readmefirst.md
582
583### Generated Redfish Message Registry
584
585[DSP0266][dsp0266], the Redfish specification, gives requirements for Redfish
586Message Registries and dictates guidelines for identifiers.
587
588The hypothetical events defined above would create a message registry similar
589to:
590
591```json
592{
593  "Id": "OpenBMC_Base_Xyz_OpenbmcProject_Software_Update.1.3.1",
594  "Language": "en",
595  "Messages": {
596    "UpdateFailure": {
597      "Description": "While updating the firmware on a device, the update failed.",
598      "Message": "A failure occurred updating %1 on %2.",
599      "Resolution": "Retry update."
600      "NumberOfArgs": 2,
601      "ParamTypes": ["string", "string"],
602      "Severity": "Critical",
603    },
604    "UpdateProgress" : {
605      "Description": "An update is in progress and has reached a checkpoint."
606      "Message": "Updating of %1 is %2\% complete.",
607      "Resolution": "None",
608      "NumberOfArgs": 2,
609      "ParamTypes": ["string", "number"],
610      "Severity": "OK",
611    }
612  }
613}
614```
615
616The prefix `OpenBMC_Base` shall be exclusively reserved for use by events from
617`phosphor-logging`. Events defined in other repositories will be expected to use
618some other prefix. Vendor-defined repositories should use a vendor-owned prefix
619as directed by [DSP0266][dsp0266].
620
621[dsp0266]:
622  https://www.dmtf.org/sites/default/files/standards/documents/DSP0266_1.20.0.pdf
623
624### Vendor implications
625
626As specified above, vendors must use their own identifiers in order to conform
627with the Redfish specification (see [DSP0266][dsp0266] for requirements on
628identifier naming). The `sdbusplus` (and `phosphor-logging` and `bmcweb`)
629implementation(s) will enable vendors to create their own events for downstream
630code and Registries for integration with Redfish, by creating downstream
631repositories of error definitions. Vendors are responsible for ensuring their
632own versioning and identifiers conform to the expectations in the [Redfish
633specification][dsp0266].
634
635One potential bad behavior on the part of vendors would be forking and modifying
636`phosphor-dbus-interfaces` defined events. Vendors must not add their own events
637to `phosphor-dbus-interfaces` in downstream implementations because it would
638lead to their implementation advertising support for a message in an
639OpenBMC-owned Registry which is not the case, but they should add them to their
640own repositories with a separate identifier. Similarly, if a vendor were to
641_backport_ upstream changes into their fork, they would need to ensure that the
642`foo.events.yaml` file for that version matches identically with the upstream
643implementation.
644
645## Alternatives Considered
646
647Many alternatives have been explored and referenced through earlier work. Within
648this proposal there are many minor-alternatives that have been assessed.
649
650### Exception inheritance
651
652The original `phosphor-logging` error descriptions allowed inheritance between
653two errors. This is not supported by the proposal for two reasons:
654
655- This introduces complexity in the Redfish Message Registry versioning because
656  a change in one file should induce version changes in all dependent files.
657
658- It makes it difficult for a developer to clearly identify all of the fields
659  they are expected to populate without traversing multiple files.
660
661### sdbusplus Exception APIs
662
663There are a few possible syntaxes I came up with for constructing the generated
664exception types. It is important that these have good ergonomics, are easy to
665understand, and can provide compile-time awareness of missing metadata fields.
666
667```cpp
668    using Example = sdbusplus::error::xyz::openbmc_project::Example;
669
670    // 1)
671    throw Example().fru("Motherboard").value(42);
672
673    // 2)
674    throw Example(Example::fru_{}, "Motherboard", Example::value_{}, 42);
675
676    // 3)
677    throw Example("FRU", "Motherboard", "VALUE", 42);
678
679    // 4)
680    throw Example([](auto e) { return e.fru("Motherboard").value(42); });
681
682    // 5)
683    throw Example({.fru = "Motherboard", .value = 42});
684```
685
686**Note**: These examples are all show using `throw` syntax, but could also be
687saved in local variables, returned from functions, or immediately passed to
688`lg2::commit`.
689
6901. This would be my preference for ergonomics and clarity, as it would allow
691   LSP-enabled editors to give completions for the metadata fields but
692   unfortunately there is no mechanism in C++ to define a type which can be
693   constructed but not thrown, which means we cannot get compile-time checking
694   of all metadata fields.
695
6962. This syntax uses tag-dispatch to enables compile-time checking of all
697   metadata fields and potential LSP-completion of the tag-types, but is more
698   verbose than option 3.
699
7003. This syntax is less verbose than (2) and follows conventions already used in
701   `phosphor-logging`'s `lg2` API, but does not allow LSP-completion of the
702   metadata tags.
703
7044. This syntax is similar to option (1) but uses an indirection of a lambda to
705   enable compile-time checking that all metadata fields have been populated by
706   the lambda. The LSP-completion is likely not as strong as option (1), due to
707   the use of `auto`, and the lambda necessity will likely be a hang-up for
708   unfamiliar developers.
709
7105. This syntax has similar characteristics as option (1) but similarly does not
711   provide compile-time confirmation that all fields have been populated.
712
713The proposal therefore suggests option (3) is most suitable.
714
715### Redfish Translation Support
716
717The proposed YAML format allows future addition of translation but it is not
718enabled at this time. Future development could enable the Redfish Message
719Registry to be generated in multiple languages if the `message:language` exists
720for those languages.
721
722### Redfish Registry Versioning
723
724The Redfish Message Registries are required to be versioned and has 3 digit
725fields (ie. `XX.YY.ZZ`), but only the first 2 are suppose to be used in the
726Message ID. Rather than using the manually specified version we could take a few
727other approaches:
728
729- Use a date code (ex. `2024.17.x`) representing the ISO 8601 week when the
730  registry was built.
731  - This does not cover vendors that may choose to branch for stabilization
732    purposes, so we can end up with two machines having the same
733    OpenBMC-versioned message registry with different content.
734
735- Use the most recent `openbmc/openbmc` tag as the version.
736  - This does not cover vendors that build off HEAD and may deploy multiple
737    images between two OpenBMC releases.
738
739- Generate the version based on the git-history.
740  - This requires `phosphor-dbus-interfaces` to be built from a git repository,
741    which may not always be true for Yocto source mirrors, and requires
742    non-trivial processing that continues to scale over time.
743
744### Existing OpenBMC Redfish Registry
745
746There are currently 191 messages defined in the existing Redfish Message
747Registry at version `OpenBMC.0.4.0`. Of those, not a single one in the codebase
748is emitted with the correct version. 96 of those are only emitted by
749Intel-specific code that is not pulled into any upstreamed machine, 39 are
750emitted by potentially common code, and 56 are not even referenced in the
751codebase outside of the bmcweb registry. Of the 39 common messages half of them
752have an equivalent in one of the standard registries that should be leveraged
753and many of the others do not have attributes that would facilitate a multi-host
754configuration, so the registry at a minimum needs to be updated. None of the
755current implementation has the capability to handle Redfish Resource URIs.
756
757The proposal therefore is to deprecate the existing registry and replace it with
758the new generated registries. For repositories that currently emit events in the
759existing format, we can maintain those call-sites for a time period of 1-2
760years.
761
762If this aspect of the proposal is rejected, the YAML format allows mapping from
763`phosphor-dbus-interfaces` defined events to the current `OpenBMC.0.4.0`
764registry `MessageIds`.
765
766Potentially common:
767
768- phosphor-post-code-manager
769  - BIOSPOSTCode (unique)
770- dbus-sensors
771  - ChassisIntrusionDetected (unique)
772  - ChassisIntrusionReset (unique)
773  - FanInserted
774  - FanRedundancyLost (unique)
775  - FanRedudancyRegained (unique)
776  - FanRemoved
777  - LanLost
778  - LanRegained
779  - PowerSupplyConfigurationError (unique)
780  - PowerSupplyConfigurationErrorRecovered (unique)
781  - PowerSupplyFailed
782  - PowerSupplyFailurePredicted (unique)
783  - PowerSupplyFanFailed
784  - PowerSupplyFanRecovered
785  - PowerSupplyPowerLost
786  - PowerSupplyPowerRestored
787  - PowerSupplyPredictiedFailureRecovered (unique)
788  - PowerSupplyRecovered
789- phosphor-sel-logger
790  - IPMIWatchdog (unique)
791  - `SensorThreshold*` : 8 different events
792- phosphor-net-ipmid
793  - InvalidLoginAttempted (unique)
794- entity-manager
795  - InventoryAdded (unique)
796  - InventoryRemoved (unique)
797- estoraged
798  - ServiceStarted
799- x86-power-control
800  - NMIButtonPressed (unique)
801  - NMIDiagnosticInterrupt (unique)
802  - PowerButtonPressed (unique)
803  - PowerRestorePolicyApplied (unique)
804  - PowerSupplyPowerGoodFailed (unique)
805  - ResetButtonPressed (unique)
806  - SystemPowerGoodFailed (unique)
807
808Intel-only implementations:
809
810- intel-ipmi-oem
811  - ADDDCCorrectable
812  - BIOSPostERROR
813  - BIOSRecoveryComplete
814  - BIOSRecoveryStart
815  - FirmwareUpdateCompleted
816  - IntelUPILinkWidthReducedToHalf
817  - IntelUPILinkWidthReducedToQuarter
818  - LegacyPCIPERR
819  - LegacyPCISERR
820  - `ME*` : 29 different events
821  - `Memory*` : 9 different events
822  - MirroringRedundancyDegraded
823  - MirroringRedundancyFull
824  - `PCIeCorrectable*`, `PCIeFatal` : 29 different events
825  - SELEntryAdded
826  - SparingRedundancyDegraded
827- pfr-manager
828  - BIOSFirmwareRecoveryReason
829  - BIOSFirmwarePanicReason
830  - BMCFirmwarePanicReason
831  - BMCFirmwareRecoveryReason
832  - BMCFirmwareResiliencyError
833  - CPLDFirmwarePanicReason
834  - CPLDFirmwareResilencyError
835  - FirmwareResiliencyError
836- host-error-monitor
837  - CPUError
838  - CPUMismatch
839  - CPUThermalTrip
840  - ComponentOverTemperature
841  - SsbThermalTrip
842  - VoltageRegulatorOverheated
843- s2600wf-misc
844  - DriveError
845  - InventoryAdded
846
847## Impacts
848
849- New APIs are defined for error and event logging. This will deprecate existing
850  `phosphor-logging` APIs, with a time to migrate, for error reporting.
851
852- The design should improve performance by eliminating the regular parsing of
853  the `systemd` journal. The design may decrease performance by allowing the
854  number of error and event logs to be dramatically increased, which have an
855  impact to file system utilization and potential for DBus impacts some services
856  such as `ObjectMapper`.
857
858- Backwards compatibility and documentation should be improved by the automatic
859  generation of the Redfish Message Registry corresponding to all error and
860  event reports.
861
862### Organizational
863
864- **Does this repository require a new repository?**
865  - No
866- **Who will be the initial maintainer(s) of this repository?**
867  - N/A
868- **Which repositories are expected to be modified to execute this design?**
869  - `sdbusplus`
870  - `phosphor-dbus-interfaces`
871  - `phosphor-logging`
872  - `bmcweb`
873  - Any repository creating an error or event.
874
875## Testing
876
877- Unit tests will be written in `sdbusplus` and `phosphor-logging` for the error
878  and event generation, creation APIs, and to provide coverage on any changes to
879  the `Logging.Entry` object management.
880
881- Unit tests will be written for `bmcweb` for basic `Logging.Entry`
882  transformation and Message Registry generation.
883
884- Integration tests should be leveraged (and enhanced as necessary) from
885  `openbmc-test-automation` to cover the end-to-end error creation and Redfish
886  reporting.
887