xref: /openbmc/docs/designs/event-logging.md (revision 9a0248b5)
1# Error and Event Logging
2
3Author: [Patrick Williams][patrick-email] `<stwcx>`
4
5[patrick-email]: mailto:patrick@stwcx.xyz
6
7Other contributors:
8
9Created: May 16, 2024
10
11## Problem Description
12
13There is currently not a consistent end-to-end error and event reporting design
14for the OpenBMC code stack. There are two different implementations, one
15primarily using phosphor-logging and one using rsyslog, both of which have gaps
16that a complete solution should address. This proposal is intended to be an
17end-to-end design handling both errors and tracing events which facilitate
18external management of the system in an automated and maintainable manner.
19
20## Background and References
21
22### Redfish LogEntry and Message Registry
23
24In Redfish, the [`LogEntry` schema][LogEntry] is used for a range of items that
25could be considered "logs", but one such use within OpenBMC is for an equivalent
26of the IPMI "System Event Log (SEL)".
27
28The IPMI SEL is the location where the BMC can collect errors and events,
29sometimes coming from other entities, such as the BIOS. Examples of these might
30be "DIMM-A0 encountered an uncorrectable ECC error" or "System boot successful".
31These SEL records are exposed as human readable strings, either natively by a
32OEM SEL design or by tools such as `ipmitool`, which are typically unique to
33each system or manufacturer, and could hypothethically change with a BMC or
34firmware update, and are thus difficult to create automated tooling around. Two
35different vendors might use different strings to represent a critical
36temperature threshold exceeded: ["temperature threshold exceeded"][HPE-Example]
37and ["Temperature #0x30 Upper Critical going high"][Oracle-Example]. There is
38also no mechanism with IPMI to ask the machine "what are all of the SELs you
39might create".
40
41In order to solve two aspects of this problem, listing of possible events and
42versioning, Redfish has Message Registries. A message registry is a versioned
43collection of all of the error events that a system could generate and hints as
44to how they might be parsed and displayed to a user. An [informative
45reference][Registry-Example] from the DMTF gives this example:
46
47```json
48{
49  "@odata.type": "#MessageRegistry.v1_0_0.MessageRegistry",
50  "Id": "Alert.1.0.0",
51  "RegistryPrefix": "Alert",
52  "RegistryVersion": "1.0.0",
53  "Messages": {
54    "LanDisconnect": {
55      "Description": "A LAN Disconnect on %1 was detected on system %2.",
56      "Message": "A LAN Disconnect on %1 was detected on system %2.",
57      "Severity": "Warning",
58      "NumberOfArgs": 2,
59      "Resolution": "None"
60    }
61  }
62}
63```
64
65This example defines an event, `Alert.1.0.LanDisconnect`, which can record the
66disconnect state of a network device and contains placeholders for the affected
67device and system. When this event occurs, there might be a `LogEntry` recorded
68containing something like:
69
70```json
71{
72  "Message": "A LAN Disconnnect on EthernetInterface 1 was detected on system /redfish/v1/Systems/1.",
73  "MessageId": "Alert.1.0.LanDisconnect",
74  "MessageArgs": ["EthernetInterface 1", "/redfish/v1/Systems/1"]
75}
76```
77
78The `Message` contains a human readable string which was created by applying the
79`MessageArgs` to the placeholders from the `Message` field in the registry.
80System management software can rely on the message registry (referenced from the
81`MessageId` field in the `LogEntry`) and `MessageArgs` to avoid needing to
82perform string processing for reacting to the event.
83
84Within OpenBMC, there is currently a [limited design][existing-design] for this
85Redfish feature and it requires inserting specially formed Redfish-specific
86logging messages into any application that wants to record these events, tightly
87coupling all applications to the Redfish implementation. It has also been
88observed that these [strings][app-example], when used, are often out of date
89with the [message registry][registry-example] advertised by `bmcweb`. Some
90maintainers have rejected adding new Redfish-specific logging messages to their
91applications.
92
93[LogEntry]:
94  https://github.com/openbmc/bmcweb/blob/de0c960c4262169ea92a4b852dd5ebbe3810bf00/redfish-core/schema/dmtf/json-schema/LogEntry.v1_16_0.json
95[HPE-Example]:
96  https://support.hpe.com/hpesc/public/docDisplay?docId=sd00002092en_us&docLocale=en_US&page=GUID-D7147C7F-2016-0901-06CE-000000000422.html
97[Oracle-Example]:
98  https://docs.oracle.com/cd/E19464-01/820-6850-11/IPMItool.html#50602039_63068
99[Registry-Example]:
100  https://www.dmtf.org/sites/default/files/Redfish%20School%20-%20Events_0.pdf
101[existing-design]:
102  https://github.com/openbmc/docs/blob/master/architecture/redfish-logging-in-bmcweb.md
103[app-example]:
104  https://github.com/openbmc/phosphor-post-code-manager/blob/f2da78deb3a105c7270f74d9d747c77f0feaae2c/src/post_code.cpp#L143
105[registry-example]:
106  https://github.com/openbmc/bmcweb/blob/4ba5be51e3fcbeed49a6a312b4e6b2f1ea7447ba/redfish-core/include/registries/openbmc.json#L5
107
108### Existing phosphor-logging implementation
109
110**Note**: While the word 'exception' is used in this section, the existing (and
111proposed) types can be used by applications and execution contexts with
112exceptions disabled. They are 'exceptions' because they do inherit from
113`std::exception` and there is support in the `sdbusplus` bindings for them to be
114used in exception handling.
115
116The `sdbusplus` bindings have the capability to define new C++ exception types
117which can be thrown by a DBus server and turned into an error response to the
118client. `phosphor-logging` extended this to also add metadata associated to the
119log type. See the following example error definitions and usages.
120
121`sdbusplus` error binding definition (in
122`xyz/openbmc_project/Certs.errors.yaml`):
123
124```yaml
125- name: InvalidCertificate
126  description: Invalid certificate file.
127```
128
129`phosphor-logging` metadata definition (in
130`xyz/openbmc_project/Certs.metadata.yaml`):
131
132```yaml
133- name: InvalidCertificate
134  meta:
135    - str: "REASON=%s"
136      type: string
137```
138
139Application code reporting an error:
140
141```cpp
142elog<InvalidCertificate>(Reason("Invalid certificate file format"));
143// or
144report<InvalidCertificate>(Reason("Existing certificate file is corrupted"));
145```
146
147In this sample, an error named
148`xyz.openbmc_project.Certs.Error.InvalidCertificate` has been defined, which can
149be sent between applications as a DBus response. The `InvalidCertificate` is
150expected to have additional metadata `REASON` which is a string. The two APIs
151`elog` and `report` have slightly different behaviors: `elog` throws an
152exception which can either result in an error DBus result or be handled
153elsewhere in the application, while `report` sends the event directly to
154`phosphor-logging`'s daemon for recording. As a side-effect of both calls, the
155metadata is inserted into the `systemd` journal.
156
157When an error is sent to the `phosphor-logging` daemon, it will:
158
1591. Search back through the journal for recorded metadata associated with the
160   event (this is a relative slow operation).
1612. Create an [`xyz.openbmc_project.Logging.Entry`][Logging-Entry] DBus object
162   with the associated data extracted from the journal.
1633. Persist a serialized version of the object.
164
165Within `bmcweb` there is support for translating
166`xyz.openbmc_project.Logging.Entry` objects advertised by `phosphor-logging`
167into Redfish `LogEntries`, but this support does not reference a Message
168Registry. This makes the events of limited utility for consumption by system
169management software, as it cannot know all of the event types and is left to
170perform (hand-coded) regular-expressions to extract any information from the
171`Message` field of the `LogEntry`. Furthermore, these regular-expressions are
172likely to become outdated over time as internal OpenBMC error reporting
173structure, metadata, or message strings evolve.
174
175[Logging-Entry]:
176  https://github.com/openbmc/phosphor-dbus-interfaces/blob/9012243e543abdc5851b7e878c17c991b2a2a8b7/yaml/xyz/openbmc_project/Logging/Entry.interface.yaml#L1
177
178### Issues with the Status Quo
179
180- There are two different implementations of error logging, neither of which are
181  both complete and fully accepted by maintainers. These implementations also do
182  not cover tracing events.
183
184- The `REDFISH_MESSAGE_ID` log approach leads to differences between the Redfish
185  Message Registry and the reporting application. It also requires every
186  application to be "Redfish aware" which limits decoupling between applications
187  and external management interfaces. This also leaves gaps for reporting errors
188  in different management interfaces, such as inband IPMI and PLDM. The approach
189  also does not provide comple-time assurance of appropriate metadata
190  collection, which can lead to producing code being out-of-date with the
191  message registry definitions.
192
193- The `phosphor-logging` approach does not provide compile-time assurance of
194  appropriate metadata collection and requires expensive daemon processing of
195  the `systemd` journal on each error report, which limits scalability.
196
197- The `sdbusplus` bindings for error reporting do not currently handle lossless
198  transmission of errors between DBus servers and clients.
199
200- Similar applications can result in different Redfish `LogEntry` for the same
201  error scenario. This has been observed in sensor threshold exceeded events
202  between `dbus-sensors`, `phosphor-hwmon`, `phosphor-virtual-sensor`, and
203  `phosphor-health-monitor`. One cause of this is two different error reporting
204  approaches and disagreements amongst maintainers as to the preferred approach.
205
206## Requirements
207
208- Applications running on the BMC must be able to report errors and failure
209  which are persisted and available for external system management through
210  standards such as Redfish.
211
212  - These errors must be structured, versioned, and the complete set of errors
213    able to be created by the BMC should be available at built-time of a BMC
214    image.
215  - The set of errors, able to be created by the BMC, must be able to be
216    transformed into relevant data sets, such as Redfish Message Registries.
217    - For Redfish, the transformation must comply with the Redfish standard
218      requirements, such as conforming to semantic versioning expectations.
219    - For Redfish, the transformation should allow mapping internally defined
220      events to pre-existing Redfish Message Registries for broader
221      compatibility.
222    - For Redfish, the implementation must also support the EventService
223      mechanics for push-reporting.
224  - Errors reported by the BMC should contain sufficient information to allow
225    service of the system for these failures, either by humans or automation
226    (depending on the individual system requirements).
227
228- Applications running on the BMC should be able to report important tracing
229  events relevant to system management and/or debug, such as the system
230  successfully reaching a running state.
231
232  - All requirements relevant to errors are also applicable to tracing events.
233  - The implementation must have a mechanism for vendors to be able to disable
234    specific tracing events to conform to their own system design requirements.
235
236- Applications running on the BMC should be able to determine when a previously
237  reported error is no longer relevant and mark it as "resolved", while
238  maintaining the persistent record for future usages such as debug.
239
240- The BMC should provide a mechanism for managed entities within the server to
241  report their own errors and events. Examples of managed entities would be
242  firmware, such as the BIOS, and satellite management controllers.
243
244- The implementation on the BMC should scale to a minimum of
245  [10,000][error-discussion] error and events without impacting the BMC or
246  managed system performance.
247
248- The implementation should provide a mechanism to allow OEM or vendor
249  extensions to the error and event definitions (and generated artifacts such as
250  the Redfish Message Registry) for usage in closed-source or non-upstreamed
251  code. These extensions must be clearly identified, in all interfaces, as
252  vendor-specific and not be tied to the OpenBMC project.
253
254- APIs to implement error and event reporting should have good ergonomics. These
255  APIs must provide compile-time identification, for applicable programming
256  languages, of call sites which do not conform to the BMC error and event
257  specifications.
258
259  - The generated error classes and APIs should not require exceptions but
260    should also integrate with the `sdbusplus` client and server bindings, which
261    do leverage exceptions.
262
263[error-discussion]:
264  https://discord.com/channels/775381525260664832/855566794994221117/867794201897992213
265
266## Proposed Design
267
268The proposed design has a few high-level design elements:
269
270- Consolidate the `sdbusplus` and `phosphor-logging` implementation of error
271  reporting; expand it to cover tracing events; improve the ergonomics of the
272  associated APIs and add compile-time checking of missing metadata.
273
274- Add APIs to `phosphor-logging` to enable daemons to easily look up their own
275  previously reported events (for marking as resolved).
276
277- Add to `phosphor-logging` a compile-time mechanism to disable recording of
278  specific tracing events for vendor-level customization.
279
280- Generate a Redfish Message Registry for all error and events defined in
281  `phosphor-dbus-interfaces`, using binding generators from `sdbusplus`. Enhance
282  `bmcweb` implementation of the `Logging.Entry` to `LogEvent` transformation to
283  cover the Redfish Message Registry and `phosphor-logging` enhancements;
284  Leverage the Redfish `LogEntry.DiagnosticData` field to provide a
285  Base64-encoded JSON representation of the entire `Logging.Entry` for
286  additional diagnostics [[does this need to be optional?]]. Add support to the
287  `bmcweb` EventService implementation to support `phosphor-logging`-hosted
288  events.
289
290### `sdbusplus`
291
292The `Foo.errors.yaml` content will be combined with the content formerly in the
293`Foo.metadata.yaml` files specified by `phosphor-logging` and specified by a new
294file type `Foo.events.yaml`. This `Foo.events.yaml` format will cover both the
295current `error` and `metadata` information as well as augment with additional
296information necessary to generate external facing datasets, such as Redfish
297Message Registries. The current `Foo.errors.yaml` and `Foo.metadata.yaml` files
298will be deprecated as their usage is replaced by the new format.
299
300The `sdbusplus` library will be enhanced to provide the following:
301
302- JSON serialization and de-serialization of generated exception types with
303  their assigned metadata; assignment of the JSON serialization to the `message`
304  field of `sd_bus_error_set` calls when errors are returned from DBus server
305  calls.
306
307- A facility to register exception types, at library load time, with the
308  `sdbusplus` library for automatic conversion back to C++ exception types in
309  DBus clients.
310
311The binding generator(s) will be expanded to do the following:
312
313- Generate complete C++ exception types, with compile-time checking of missing
314  metadata and JSON serialization, for errors and events. Metadata can be of one
315  of the following types:
316
317  - size-type and signed integer
318  - floating-point number
319  - string
320  - DBus object path
321
322- Generate a format that `bmcweb` can use to create and populate a Redfish
323  Message Registry, and translate from `phosphor-logging` to Redfish `LogEntry`
324  for a set of errors and events
325
326For general users of `sdbusplus` these changes should have no impact, except for
327the availability of new generated exception types and that specialized instances
328of `sdbusplus::exception::generated_exception` will become available in DBus
329clients.
330
331### `phosphor-dbus-interfaces`
332
333Refactoring will be done to migrate existing `Foo.metadata.yaml` and
334`Foo.errors.yaml` content to the `Foo.events.yaml` as migration is done by
335applications. Minor changes will take place to utilize the new binding
336generators from `sdbusplus`. A small library enhancement will be done to
337register all generated exception types with `sdbusplus`. Future contributors
338will be able to contribute new error and tracing event definitions.
339
340### `phosphor-logging`
341
342> TODO: Should a tracing event be a `Logging.Entry` with severity of
343> `Informational` or should they be a new type, such as `Logging.Event` and
344> managed separately. The `phosphor-logging` default `meson.options` have
345> `error_cap=200` and `error_info_cap=10`. If we increase the total number of
346> events allowed to 10K, the majority of them are likely going to be information
347> / tracing events.
348
349The `Logging.Entry` interface's `AdditionalData` property should change to
350`dict[string, variant[string,int64_t,size_t,object_path]]`.
351
352The `Logging.Create` interface will have a new method added:
353
354```yaml
355- name: CreateEntry
356  parameters:
357    - name: Message
358      type: string
359    - name: Severity
360      type: enum[Logging.Entry.Level]
361    - name: AdditionalData
362      type: dict[string, variant[string,int64_t,size_t,object_path]]
363    - name: Hint
364      type: string
365      default: ""
366  returns:
367    - name: Entry
368      type: object_path
369```
370
371The `Hint` parameter is used for daemons to be able to query for their
372previously recorded error, for marking as resolved. These strings need to be
373globally unique and are suggested to be of the format `"<service_name>:<key>"`.
374
375A `Logging.SearchHint` interface will be created, which will be recorded at the
376same object path as a `Logging.Entry` when the `Hint` parameter was not an empty
377string:
378
379```yaml
380- property: Hint
381  type: string
382```
383
384The `Logging.Manager` interface will be added with a single method:
385
386```yaml
387- name: FindEntry
388  parameters:
389    - name: Hint
390      type: String
391  returns:
392    - name: Entry
393      type: object_path
394  errors:
395    - xyz.openbmc_project.Common.ResourceNotFound
396```
397
398A `lg2::commit` API will be added to support the new `sdbusplus` generated
399exception types, calling the new `Logging.Create.CreateEntry` method proposed
400earlier. This new API will support `sdbusplus::bus_t` for synchronous DBus
401operations and both `sdbusplus::async::context_t` and
402`sdbusplus::asio::connection` for asynchronous DBus operations.
403
404There are outstanding performance concerns with the `phosphor-logging`
405implementation that may impact the ability for scaling to 10,000 event records.
406This issue is expected to be self-contained within `phosphor-logging`, except
407for potential future changes to the log-retrieval interfaces used by `bmcweb`.
408In order to decouple the transition to this design, by callers of the logging
409APIs, from the experimentation and improvements in `phosphor-logging`, we will
410add a compile option and Yocto `DISTRO_FEATURE` that can turn `lg2::commit`
411behavior into an `OPENBMC_MESSAGE_ID` record in the journal, along the same
412approach as the previous `REDFISH_MESSAGE_ID`, and corresponding `rsyslog`
413configuration and `bmcweb` support to use these directly. This will allow
414systems which knowingly scale to a large number of event records, using
415`rsyslog` mechanics, the same level of performance. One caveat of this support
416is that the hint and resolution behavior will not exist when that option is
417enabled.
418
419### `bmcweb`
420
421`bmcweb` already has support for build-time conversion from a Redfish Message
422Registry, codified in JSON, to header files it uses to serve the registry; this
423will be expanded to support Redfish Message Registries generated by `sdbusplus`.
424`bmcweb` will add a Meson option for additional message registries, provided
425from bitbake from `phosphor-dbus-interfaces` and vendor-specific event
426definitions as a path to a directory of Message Registry JSONs. Support will
427also be added for adding `phosphor-dbus-interfaces` as a Meson subproject for
428stand-alone testing.
429
430It is desirable for `sdbusplus` to generate a Redfish Message Registry directly,
431leveraging the existing scripts for integration with `bmcweb`. As part of this
432we would like to support mapping a `Logging.Entry` event to an existing
433standardized Redfish event (such as those in the Base registry). The generated
434information must contain the `Logging.Entry::Message` identifier, the
435`AdditionalData` to `MessageArgs` mapping, and the translation from the
436`Message` identifier to the Redfish Message ID (when the Message ID is not from
437"this" registry). In order to facilitate this, we will need to add OEM fields to
438the Redfish Message Registry JSON, which are only used by the `bmcweb`
439processing scripts, to generate the information necessary for this additional
440mapping.
441
442The `xyz.openbmc_project.Logging.Entry` to `LogEvent` conversion needs to be
443enhanced, to utilize these Message Registries, in four ways:
444
4451. A Base64-encoded JSON representation of the `Logging.Entry` will be assigned
446   to the `DiagnosticData` property.
447
4482. If the `Logging.Entry::Message` contains an identifier corresponding to a
449   Registry entry, the `MessageId` property will be set to the corresponding
450   Redfish Message ID. Otherwise, the `Logging.Entry::Message` will be used
451   directly with no further transformation (as is done today).
452
4533. If the `Logging.Entry::Message` contains an identifier corresponding to a
454   Registry entry, the `MessageArgs` property will be filled in by obtaining the
455   corresponding values from the `AdditionalData` dictionary and the `Message`
456   field will be generated from combining these values with the `Message` string
457   from the Registry.
458
4594. A mechanism should be implemented to translate DBus `object_path` references
460   to Redfish Resource URIs. When an `object_path` cannot be translated,
461   `bmcweb` will use a prefix such as `object_path:` in the `MessageArgs` value.
462
463The implementation of `EventService` should be enhanced to support
464`phosphor-logging` hosted events. The implementation of `LogService` should be
465enhanced to support log paging for `phosphor-logging` hosted events.
466
467### `phosphor-sel-logger`
468
469The `phosphor-sel-logger` has a meson option `send-to-logger` which toggles
470between using `phosphor-logging` or the [`REDFISH_MESSAGE_ID`
471mechanism][existing-design]. The `phosphor-logging`-utilizing paths will be
472updated to utilize `phosphor-dbus-interfaces` specified errors and events.
473
474### YAML format
475
476Consider an example file in `phosphor-dbus-interfaces` as
477`yaml/xyz/openbmc_project/Software/Update.events.yaml` with hypothetical errors
478and events:
479
480```yaml
481version: 1.3.1
482
483errors:
484  - name: UpdateFailure
485    severity: critical
486    metadata:
487      - name: TARGET
488        type: string
489        primary: true
490      - name: ERRNO
491        type: int64
492      - name: CALLOUT_HARDWARE
493        type: object_path
494        primary: true
495    en:
496      description: While updating the firmware on a device, the update failed.
497      message: A failure occurred updating {TARGET} on {CALLOUT_HARDWARE}.
498      resolution: Retry update.
499
500  - name: BMCUpdateFailure
501    severity: critical
502    deprecated: 1.0.0
503    en:
504      description: Failed to update the BMC
505    redfish-mapping: OpenBMC.FirmwareUpdateFailed
506
507events:
508  - name: UpdateProgress
509    metadata:
510      - name: TARGET
511        type: string
512        primary: true
513      - name: COMPLETION
514        type: double
515        primary: true
516    en:
517      description: An update is in progress and has reached a checkpoint.
518      message: Updating of {TARGET} is {COMPLETION}% complete.
519```
520
521Each `foo.events.yaml` file would be used to generate both the C++ classes (via
522`sdbusplus`) for exception handling and event reporting, as well as a versioned
523Redfish Message Registry for the errors and events. The [YAML
524schema][yaml-schema] is contained in the sdbusplus repository.
525
526The above example YAML would generate C++ classes similar to:
527
528```cpp
529namespace sdbusplus::errors::xyz::openbmc_project::software::update
530{
531
532class UpdateFailure
533{
534
535    template <typename... Args>
536    UpdateFailure(Args&&... args);
537};
538
539}
540
541namespace sdbusplus::events::xyz::openbmc_project::software::update
542{
543
544class UpdateProgress
545{
546    template <typename... Args>
547    UpdateProgress(Args&&... args);
548};
549
550}
551```
552
553The constructors here are variadic templates because the generated constructor
554implementation will provide compile-time assurance that all of the metadata
555fields have been populated (in any order). To raise an `UpdateFailure` a
556developers might do something like:
557
558```cpp
559// Immediately report the event:
560lg2::commit(UpdateFailure("TARGET", "BMC Flash A", "ERRNO", rc, "CALLOUT_HARDWARE", bmc_object_path));
561// or send it in a dbus response (when using sdbusplus generated binding):
562throw UpdateFailure("TARGET", "BMC Flash A", "ERRNO", rc, "CALLOUT_HARDWARE", bmc_object_path);
563```
564
565If one of the fields, such as `ERRNO` were omitted, a compile failure will be
566raised indicating the first missing field.
567
568[yaml-schema]:
569  https://github.com/openbmc/sdbusplus/blob/master/tools/sdbusplus/schemas/events.schema.yaml
570
571### Versioning Policy
572
573Assume the version follows semantic versioning `MAJOR.MINOR.PATCH` convention.
574
575- Adjusting a description or message should result in a `PATCH` increment.
576- Adding a new error or event, or adding metadata to an existing error or event,
577  should result in a `MINOR` increment.
578- Deprecating an error or event should result in a `MAJOR` increment.
579
580There is [guidance on maintenance][registry-guidance] of the OpenBMC Message
581Registry. We will incorporate that guidance into the equivalent
582`phosphor-dbus-interfaces` policy.
583
584[registry-guidance]:
585  https://github.com/openbmc/bmcweb/blob/master/redfish-core/include/registries/openbmc_message_registry.readmefirst.md
586
587### Generated Redfish Message Registry
588
589[DSP0266][dsp0266], the Redfish specification, gives requirements for Redfish
590Message Registries and dictates guidelines for identifiers.
591
592The hypothetical events defined above would create a message registry similar
593to:
594
595```json
596{
597  "Id": "OpenBMC_Base_Xyz_OpenbmcProject_Software_Update.1.3.1",
598  "Language": "en",
599  "Messages": {
600    "UpdateFailure": {
601      "Description": "While updating the firmware on a device, the update failed.",
602      "Message": "A failure occurred updating %1 on %2.",
603      "Resolution": "Retry update."
604      "NumberOfArgs": 2,
605      "ParamTypes": ["string", "string"],
606      "Severity": "Critical",
607    },
608    "UpdateProgress" : {
609      "Description": "An update is in progress and has reached a checkpoint."
610      "Message": "Updating of %1 is %2\% complete.",
611      "Resolution": "None",
612      "NumberOfArgs": 2,
613      "ParamTypes": ["string", "number"],
614      "Severity": "OK",
615    }
616  }
617}
618```
619
620The prefix `OpenBMC_Base` shall be exclusively reserved for use by events from
621`phosphor-logging`. Events defined in other repositories will be expected to use
622some other prefix. Vendor-defined repositories should use a vendor-owned prefix
623as directed by [DSP0266][dsp0266].
624
625[dsp0266]:
626  https://www.dmtf.org/sites/default/files/standards/documents/DSP0266_1.20.0.pdf
627
628### Vendor implications
629
630As specified above, vendors must use their own identifiers in order to conform
631with the Redfish specification (see [DSP0266][dsp0266] for requirements on
632identifier naming). The `sdbusplus` (and `phosphor-logging` and `bmcweb`)
633implementation(s) will enable vendors to create their own events for downstream
634code and Registries for integration with Redfish, by creating downstream
635repositories of error definitions. Vendors are responsible for ensuring their
636own versioning and identifiers conform to the expectations in the [Redfish
637specification][dsp0266].
638
639One potential bad behavior on the part of vendors would be forking and modifying
640`phosphor-dbus-interfaces` defined events. Vendors must not add their own events
641to `phosphor-dbus-interfaces` in downstream implementations because it would
642lead to their implementation advertising support for a message in an
643OpenBMC-owned Registry which is not the case, but they should add them to their
644own repositories with a separate identifier. Similarly, if a vendor were to
645_backport_ upstream changes into their fork, they would need to ensure that the
646`foo.events.yaml` file for that version matches identically with the upstream
647implementation.
648
649## Alternatives Considered
650
651Many alternatives have been explored and referenced through earlier work. Within
652this proposal there are many minor-alternatives that have been assessed.
653
654### Exception inheritance
655
656The original `phosphor-logging` error descriptions allowed inheritance between
657two errors. This is not supported by the proposal for two reasons:
658
659- This introduces complexity in the Redfish Message Registry versioning because
660  a change in one file should induce version changes in all dependent files.
661
662- It makes it difficult for a developer to clearly identify all of the fields
663  they are expected to populate without traversing multiple files.
664
665### sdbusplus Exception APIs
666
667There are a few possible syntaxes I came up with for constructing the generated
668exception types. It is important that these have good ergonomics, are easy to
669understand, and can provide compile-time awareness of missing metadata fields.
670
671```cpp
672    using Example = sdbusplus::error::xyz::openbmc_project::Example;
673
674    // 1)
675    throw Example().fru("Motherboard").value(42);
676
677    // 2)
678    throw Example(Example::fru_{}, "Motherboard", Example::value_{}, 42);
679
680    // 3)
681    throw Example("FRU", "Motherboard", "VALUE", 42);
682
683    // 4)
684    throw Example([](auto e) { return e.fru("Motherboard").value(42); });
685
686    // 5)
687    throw Example({.fru = "Motherboard", .value = 42});
688```
689
690**Note**: These examples are all show using `throw` syntax, but could also be
691saved in local variables, returned from functions, or immediately passed to
692`lg2::commit`.
693
6941. This would be my preference for ergonomics and clarity, as it would allow
695   LSP-enabled editors to give completions for the metadata fields but
696   unfortunately there is no mechanism in C++ to define a type which can be
697   constructed but not thrown, which means we cannot get compile-time checking
698   of all metadata fields.
699
7002. This syntax uses tag-dispatch to enables compile-time checking of all
701   metadata fields and potential LSP-completion of the tag-types, but is more
702   verbose than option 3.
703
7043. This syntax is less verbose than (2) and follows conventions already used in
705   `phosphor-logging`'s `lg2` API, but does not allow LSP-completion of the
706   metadata tags.
707
7084. This syntax is similar to option (1) but uses an indirection of a lambda to
709   enable compile-time checking that all metadata fields have been populated by
710   the lambda. The LSP-completion is likely not as strong as option (1), due to
711   the use of `auto`, and the lambda necessity will likely be a hang-up for
712   unfamiliar developers.
713
7145. This syntax has similar characteristics as option (1) but similarly does not
715   provide compile-time confirmation that all fields have been populated.
716
717The proposal therefore suggests option (3) is most suitable.
718
719### Redfish Translation Support
720
721The proposed YAML format allows future addition of translation but it is not
722enabled at this time. Future development could enable the Redfish Message
723Registry to be generated in multiple languages if the `message:language` exists
724for those languages.
725
726### Redfish Registry Versioning
727
728The Redfish Message Registries are required to be versioned and has 3 digit
729fields (ie. `XX.YY.ZZ`), but only the first 2 are suppose to be used in the
730Message ID. Rather than using the manually specified version we could take a few
731other approaches:
732
733- Use a date code (ex. `2024.17.x`) representing the ISO 8601 week when the
734  registry was built.
735
736  - This does not cover vendors that may choose to branch for stabilization
737    purposes, so we can end up with two machines having the same
738    OpenBMC-versioned message registry with different content.
739
740- Use the most recent `openbmc/openbmc` tag as the version.
741
742  - This does not cover vendors that build off HEAD and may deploy multiple
743    images between two OpenBMC releases.
744
745- Generate the version based on the git-history.
746
747  - This requires `phosphor-dbus-interfaces` to be built from a git repository,
748    which may not always be true for Yocto source mirrors, and requires
749    non-trivial processing that continues to scale over time.
750
751### Existing OpenBMC Redfish Registry
752
753There are currently 191 messages defined in the existing Redfish Message
754Registry at version `OpenBMC.0.4.0`. Of those, not a single one in the codebase
755is emitted with the correct version. 96 of those are only emitted by
756Intel-specific code that is not pulled into any upstreamed machine, 39 are
757emitted by potentially common code, and 56 are not even referenced in the
758codebase outside of the bmcweb registry. Of the 39 common messages half of them
759have an equivalent in one of the standard registries that should be leveraged
760and many of the others do not have attributes that would facilitate a multi-host
761configuration, so the registry at a minimum needs to be updated. None of the
762current implementation has the capability to handle Redfish Resource URIs.
763
764The proposal therefore is to deprecate the existing registry and replace it with
765the new generated registries. For repositories that currently emit events in the
766existing format, we can maintain those call-sites for a time period of 1-2
767years.
768
769If this aspect of the proposal is rejected, the YAML format allows mapping from
770`phosphor-dbus-interfaces` defined events to the current `OpenBMC.0.4.0`
771registry `MessageIds`.
772
773Potentially common:
774
775- phosphor-post-code-manager
776  - BIOSPOSTCode (unique)
777- dbus-sensors
778  - ChassisIntrusionDetected (unique)
779  - ChassisIntrusionReset (unique)
780  - FanInserted
781  - FanRedundancyLost (unique)
782  - FanRedudancyRegained (unique)
783  - FanRemoved
784  - LanLost
785  - LanRegained
786  - PowerSupplyConfigurationError (unique)
787  - PowerSupplyConfigurationErrorRecovered (unique)
788  - PowerSupplyFailed
789  - PowerSupplyFailurePredicted (unique)
790  - PowerSupplyFanFailed
791  - PowerSupplyFanRecovered
792  - PowerSupplyPowerLost
793  - PowerSupplyPowerRestored
794  - PowerSupplyPredictiedFailureRecovered (unique)
795  - PowerSupplyRecovered
796- phosphor-sel-logger
797  - IPMIWatchdog (unique)
798  - `SensorThreshold*` : 8 different events
799- phosphor-net-ipmid
800  - InvalidLoginAttempted (unique)
801- entity-manager
802  - InventoryAdded (unique)
803  - InventoryRemoved (unique)
804- estoraged
805  - ServiceStarted
806- x86-power-control
807  - NMIButtonPressed (unique)
808  - NMIDiagnosticInterrupt (unique)
809  - PowerButtonPressed (unique)
810  - PowerRestorePolicyApplied (unique)
811  - PowerSupplyPowerGoodFailed (unique)
812  - ResetButtonPressed (unique)
813  - SystemPowerGoodFailed (unique)
814
815Intel-only implementations:
816
817- intel-ipmi-oem
818  - ADDDCCorrectable
819  - BIOSPostERROR
820  - BIOSRecoveryComplete
821  - BIOSRecoveryStart
822  - FirmwareUpdateCompleted
823  - IntelUPILinkWidthReducedToHalf
824  - IntelUPILinkWidthReducedToQuarter
825  - LegacyPCIPERR
826  - LegacyPCISERR
827  - `ME*` : 29 different events
828  - `Memory*` : 9 different events
829  - MirroringRedundancyDegraded
830  - MirroringRedundancyFull
831  - `PCIeCorrectable*`, `PCIeFatal` : 29 different events
832  - SELEntryAdded
833  - SparingRedundancyDegraded
834- pfr-manager
835  - BIOSFirmwareRecoveryReason
836  - BIOSFirmwarePanicReason
837  - BMCFirmwarePanicReason
838  - BMCFirmwareRecoveryReason
839  - BMCFirmwareResiliencyError
840  - CPLDFirmwarePanicReason
841  - CPLDFirmwareResilencyError
842  - FirmwareResiliencyError
843- host-error-monitor
844  - CPUError
845  - CPUMismatch
846  - CPUThermalTrip
847  - ComponentOverTemperature
848  - SsbThermalTrip
849  - VoltageRegulatorOverheated
850- s2600wf-misc
851  - DriveError
852  - InventoryAdded
853
854## Impacts
855
856- New APIs are defined for error and event logging. This will deprecate existing
857  `phosphor-logging` APIs, with a time to migrate, for error reporting.
858
859- The design should improve performance by eliminating the regular parsing of
860  the `systemd` journal. The design may decrease performance by allowing the
861  number of error and event logs to be dramatically increased, which have an
862  impact to file system utilization and potential for DBus impacts some services
863  such as `ObjectMapper`.
864
865- Backwards compatibility and documentation should be improved by the automatic
866  generation of the Redfish Message Registry corresponding to all error and
867  event reports.
868
869### Organizational
870
871- **Does this repository require a new repository?**
872  - No
873- **Who will be the initial maintainer(s) of this repository?**
874  - N/A
875- **Which repositories are expected to be modified to execute this design?**
876  - `sdbusplus`
877  - `phosphor-dbus-interfaces`
878  - `phosphor-logging`
879  - `bmcweb`
880  - Any repository creating an error or event.
881
882## Testing
883
884- Unit tests will be written in `sdbusplus` and `phosphor-logging` for the error
885  and event generation, creation APIs, and to provide coverage on any changes to
886  the `Logging.Entry` object management.
887
888- Unit tests will be written for `bmcweb` for basic `Logging.Entry`
889  transformation and Message Registry generation.
890
891- Integration tests should be leveraged (and enhanced as necessary) from
892  `openbmc-test-automation` to cover the end-to-end error creation and Redfish
893  reporting.
894