xref: /openbmc/docs/designs/event-logging.md (revision ede0a25e)
1# Error and Event Logging
2
3Author: [Patrick Williams][patrick-email] `<stwcx>`
4
5[patrick-email]: mailto:patrick@stwcx.xyz
6
7Other contributors:
8
9Created: May 16, 2024
10
11## Problem Description
12
13There is currently not a consistent end-to-end error and event reporting design
14for the OpenBMC code stack. There are two different implementations, one
15primarily using phosphor-logging and one using rsyslog, both of which have gaps
16that a complete solution should address. This proposal is intended to be an
17end-to-end design handling both errors and tracing events which facilitate
18external management of the system in an automated and maintainable manner.
19
20## Background and References
21
22### Redfish LogEntry and Message Registry
23
24In Redfish, the [`LogEntry` schema][LogEntry] is used for a range of items that
25could be considered "logs", but one such use within OpenBMC is for an equivalent
26of the IPMI "System Event Log (SEL)".
27
28The IPMI SEL is the location where the BMC can collect errors and events,
29sometimes coming from other entities, such as the BIOS. Examples of these might
30be "DIMM-A0 encountered an uncorrectable ECC error" or "System boot successful".
31These SEL records are exposed as human readable strings, either natively by a
32OEM SEL design or by tools such as `ipmitool`, which are typically unique to
33each system or manufacturer, and could hypothethically change with a BMC or
34firmware update, and are thus difficult to create automated tooling around. Two
35different vendors might use different strings to represent a critical
36temperature threshold exceeded: ["temperature threshold exceeded"][HPE-Example]
37and ["Temperature #0x30 Upper Critical going high"][Oracle-Example]. There is
38also no mechanism with IPMI to ask the machine "what are all of the SELs you
39might create".
40
41In order to solve two aspects of this problem, listing of possible events and
42versioning, Redfish has Message Registries. A message registry is a versioned
43collection of all of the error events that a system could generate and hints as
44to how they might be parsed and displayed to a user. An [informative
45reference][Registry-Example] from the DMTF gives this example:
46
47```json
48{
49  "@odata.type": "#MessageRegistry.v1_0_0.MessageRegistry",
50  "Id": "Alert.1.0.0",
51  "RegistryPrefix": "Alert",
52  "RegistryVersion": "1.0.0",
53  "Messages": {
54    "LanDisconnect": {
55      "Description": "A LAN Disconnect on %1 was detected on system %2.",
56      "Message": "A LAN Disconnect on %1 was detected on system %2.",
57      "Severity": "Warning",
58      "NumberOfArgs": 2,
59      "Resolution": "None"
60    }
61  }
62}
63```
64
65This example defines an event, `Alert.1.0.LanDisconnect`, which can record the
66disconnect state of a network device and contains placeholders for the affected
67device and system. When this event occurs, there might be a `LogEntry` recorded
68containing something like:
69
70```json
71{
72  "Message": "A LAN Disconnnect on EthernetInterface 1 was detected on system /redfish/v1/Systems/1.",
73  "MessageId": "Alert.1.0.LanDisconnect",
74  "MessageArgs": ["EthernetInterface 1", "/redfish/v1/Systems/1"]
75}
76```
77
78The `Message` contains a human readable string which was created by applying the
79`MessageArgs` to the placeholders from the `Message` field in the registry.
80System management software can rely on the message registry (referenced from the
81`MessageId` field in the `LogEntry`) and `MessageArgs` to avoid needing to
82perform string processing for reacting to the event.
83
84Within OpenBMC, there is currently a [limited design][existing-design] for this
85Redfish feature and it requires inserting specially formed Redfish-specific
86logging messages into any application that wants to record these events, tightly
87coupling all applications to the Redfish implementation. It has also been
88observed that these [strings][app-example], when used, are often out of date
89with the [message registry][registry-example] advertised by `bmcweb`. Some
90maintainers have rejected adding new Redfish-specific logging messages to their
91applications.
92
93[LogEntry]:
94  https://github.com/openbmc/bmcweb/blob/de0c960c4262169ea92a4b852dd5ebbe3810bf00/redfish-core/schema/dmtf/json-schema/LogEntry.v1_16_0.json
95[HPE-Example]:
96  https://support.hpe.com/hpesc/public/docDisplay?docId=sd00002092en_us&docLocale=en_US&page=GUID-D7147C7F-2016-0901-06CE-000000000422.html
97[Oracle-Example]:
98  https://docs.oracle.com/cd/E19464-01/820-6850-11/IPMItool.html#50602039_63068
99[Registry-Example]:
100  https://www.dmtf.org/sites/default/files/Redfish%20School%20-%20Events_0.pdf
101[existing-design]:
102  https://github.com/openbmc/docs/blob/master/architecture/redfish-logging-in-bmcweb.md
103[app-example]:
104  https://github.com/openbmc/phosphor-post-code-manager/blob/f2da78deb3a105c7270f74d9d747c77f0feaae2c/src/post_code.cpp#L143
105[registry-example]:
106  https://github.com/openbmc/bmcweb/blob/4ba5be51e3fcbeed49a6a312b4e6b2f1ea7447ba/redfish-core/include/registries/openbmc.json#L5
107
108### Existing phosphor-logging implementation
109
110**Note**: While the word 'exception' is used in this section, the existing (and
111proposed) types can be used by applications and execution contexts with
112exceptions disabled. They are 'exceptions' because they do inherit from
113`std::exception` and there is support in the `sdbusplus` bindings for them to be
114used in exception handling.
115
116The `sdbusplus` bindings have the capability to define new C++ exception types
117which can be thrown by a DBus server and turned into an error response to the
118client. `phosphor-logging` extended this to also add metadata associated to the
119log type. See the following example error definitions and usages.
120
121`sdbusplus` error binding definition (in
122`xyz/openbmc_project/Certs.errors.yaml`):
123
124```yaml
125- name: InvalidCertificate
126  description: Invalid certificate file.
127```
128
129`phosphor-logging` metadata definition (in
130`xyz/openbmc_project/Certs.metadata.yaml`):
131
132```yaml
133- name: InvalidCertificate
134  meta:
135    - str: "REASON=%s"
136      type: string
137```
138
139Application code reporting an error:
140
141```cpp
142elog<InvalidCertificate>(Reason("Invalid certificate file format"));
143// or
144report<InvalidCertificate>(Reason("Existing certificate file is corrupted"));
145```
146
147In this sample, an error named
148`xyz.openbmc_project.Certs.Error.InvalidCertificate` has been defined, which can
149be sent between applications as a DBus response. The `InvalidCertificate` is
150expected to have additional metadata `REASON` which is a string. The two APIs
151`elog` and `report` have slightly different behaviors: `elog` throws an
152exception which can either result in an error DBus result or be handled
153elsewhere in the application, while `report` sends the event directly to
154`phosphor-logging`'s daemon for recording. As a side-effect of both calls, the
155metadata is inserted into the `systemd` journal.
156
157When an error is sent to the `phosphor-logging` daemon, it will:
158
1591. Search back through the journal for recorded metadata associated with the
160   event (this is a relative slow operation).
1612. Create an [`xyz.openbmc_project.Logging.Entry`][Logging-Entry] DBus object
162   with the associated data extracted from the journal.
1633. Persist a serialized version of the object.
164
165Within `bmcweb` there is support for translating
166`xyz.openbmc_project.Logging.Entry` objects advertised by `phosphor-logging`
167into Redfish `LogEntries`, but this support does not reference a Message
168Registry. This makes the events of limited utility for consumption by system
169management software, as it cannot know all of the event types and is left to
170perform (hand-coded) regular-expressions to extract any information from the
171`Message` field of the `LogEntry`. Furthermore, these regular-expressions are
172likely to become outdated over time as internal OpenBMC error reporting
173structure, metadata, or message strings evolve.
174
175[Logging-Entry]:
176  https://github.com/openbmc/phosphor-dbus-interfaces/blob/9012243e543abdc5851b7e878c17c991b2a2a8b7/yaml/xyz/openbmc_project/Logging/Entry.interface.yaml#L1
177
178### Issues with the Status Quo
179
180- There are two different implementations of error logging, neither of which are
181  both complete and fully accepted by maintainers. These implementations also do
182  not cover tracing events.
183
184- The `REDFISH_MESSAGE_ID` log approach leads to differences between the Redfish
185  Message Registry and the reporting application. It also requires every
186  application to be "Redfish aware" which limits decoupling between applications
187  and external management interfaces. This also leaves gaps for reporting errors
188  in different management interfaces, such as inband IPMI and PLDM. The approach
189  also does not provide comple-time assurance of appropriate metadata
190  collection, which can lead to producing code being out-of-date with the
191  message registry definitions.
192
193- The `phosphor-logging` approach does not provide compile-time assurance of
194  appropriate metadata collection and requires expensive daemon processing of
195  the `systemd` journal on each error report, which limits scalability.
196
197- The `sdbusplus` bindings for error reporting do not currently handle lossless
198  transmission of errors between DBus servers and clients.
199
200- Similar applications can result in different Redfish `LogEntry` for the same
201  error scenario. This has been observed in sensor threshold exceeded events
202  between `dbus-sensors`, `phosphor-hwmon`, `phosphor-virtual-sensor`, and
203  `phosphor-health-monitor`. One cause of this is two different error reporting
204  approaches and disagreements amongst maintainers as to the preferred approach.
205
206## Requirements
207
208- Applications running on the BMC must be able to report errors and failure
209  which are persisted and available for external system management through
210  standards such as Redfish.
211
212  - These errors must be structured, versioned, and the complete set of errors
213    able to be created by the BMC should be available at built-time of a BMC
214    image.
215  - The set of errors, able to be created by the BMC, must be able to be
216    transformed into relevant data sets, such as Redfish Message Registries.
217    - For Redfish, the transformation must comply with the Redfish standard
218      requirements, such as conforming to semantic versioning expectations.
219    - For Redfish, the transformation should allow mapping internally defined
220      events to pre-existing Redfish Message Registries for broader
221      compatibility.
222    - For Redfish, the implementation must also support the EventService
223      mechanics for push-reporting.
224  - Errors reported by the BMC should contain sufficient information to allow
225    service of the system for these failures, either by humans or automation
226    (depending on the individual system requirements).
227
228- Applications running on the BMC should be able to report important tracing
229  events relevant to system management and/or debug, such as the system
230  successfully reaching a running state.
231
232  - All requirements relevant to errors are also applicable to tracing events.
233  - The implementation must have a mechanism for vendors to be able to disable
234    specific tracing events to conform to their own system design requirements.
235
236- Applications running on the BMC should be able to determine when a previously
237  reported error is no longer relevant and mark it as "resolved", while
238  maintaining the persistent record for future usages such as debug.
239
240- The BMC should provide a mechanism for managed entities within the server to
241  report their own errors and events. Examples of managed entities would be
242  firmware, such as the BIOS, and satellite management controllers.
243
244- The implementation on the BMC should scale to a minimum of
245  [10,000][error-discussion] error and events without impacting the BMC or
246  managed system performance.
247
248- The implementation should provide a mechanism to allow OEM or vendor
249  extensions to the error and event definitions (and generated artifacts such as
250  the Redfish Message Registry) for usage in closed-source or non-upstreamed
251  code. These extensions must be clearly identified, in all interfaces, as
252  vendor-specific and not be tied to the OpenBMC project.
253
254- APIs to implement error and event reporting should have good ergonomics. These
255  APIs must provide compile-time identification, for applicable programming
256  languages, of call sites which do not conform to the BMC error and event
257  specifications.
258
259  - The generated error classes and APIs should not require exceptions but
260    should also integrate with the `sdbusplus` client and server bindings, which
261    do leverage exceptions.
262
263[error-discussion]:
264  https://discord.com/channels/775381525260664832/855566794994221117/867794201897992213
265
266## Proposed Design
267
268The proposed design has a few high-level design elements:
269
270- Consolidate the `sdbusplus` and `phosphor-logging` implementation of error
271  reporting; expand it to cover tracing events; improve the ergonomics of the
272  associated APIs and add compile-time checking of missing metadata.
273
274- Add APIs to `phosphor-logging` to enable daemons to easily look up their own
275  previously reported events (for marking as resolved).
276
277- Add to `phosphor-logging` a compile-time mechanism to disable recording of
278  specific tracing events for vendor-level customization.
279
280- Generate a Redfish Message Registry for all error and events defined in
281  `phosphor-dbus-interfaces`, using binding generators from `sdbusplus`. Enhance
282  `bmcweb` implementation of the `Logging.Entry` to `LogEvent` transformation to
283  cover the Redfish Message Registry and `phosphor-logging` enhancements;
284  Leverage the Redfish `LogEntry.DiagnosticData` field to provide a
285  Base64-encoded JSON representation of the entire `Logging.Entry` for
286  additional diagnostics [[does this need to be optional?]]. Add support to the
287  `bmcweb` EventService implementation to support `phosphor-logging`-hosted
288  events.
289
290### `sdbusplus`
291
292The `Foo.errors.yaml` content will be combined with the content formerly in the
293`Foo.metadata.yaml` files specified by `phosphor-logging` and specified by a new
294file type `Foo.events.yaml`. This `Foo.events.yaml` format will cover both the
295current `error` and `metadata` information as well as augment with additional
296information necessary to generate external facing datasets, such as Redfish
297Message Registries. The current `Foo.errors.yaml` and `Foo.metadata.yaml` files
298will be deprecated as their usage is replaced by the new format.
299
300The `sdbusplus` library will be enhanced to provide the following:
301
302- JSON serialization and de-serialization of generated exception types with
303  their assigned metadata; assignment of the JSON serialization to the `message`
304  field of `sd_bus_error_set` calls when errors are returned from DBus server
305  calls.
306
307- A facility to register exception types, at library load time, with the
308  `sdbusplus` library for automatic conversion back to C++ exception types in
309  DBus clients.
310
311The binding generator(s) will be expanded to do the following:
312
313- Generate complete C++ exception types, with compile-time checking of missing
314  metadata and JSON serialization, for errors and events. Metadata can be of one
315  of the following types:
316
317  - size-type and signed integer
318  - floating-point number
319  - string
320  - DBus object path
321
322- Generate a format that `bmcweb` can use to create and populate a Redfish
323  Message Registry, and translate from `phosphor-logging` to Redfish `LogEntry`
324  for a set of errors and events
325
326For general users of `sdbusplus` these changes should have no impact, except for
327the availability of new generated exception types and that specialized instances
328of `sdbusplus::exception::generated_exception` will become available in DBus
329clients.
330
331### `phosphor-dbus-interfaces`
332
333Refactoring will be done to migrate existing `Foo.metadata.yaml` and
334`Foo.errors.yaml` content to the `Foo.events.yaml` as migration is done by
335applications. Minor changes will take place to utilize the new binding
336generators from `sdbusplus`. A small library enhancement will be done to
337register all generated exception types with `sdbusplus`. Future contributors
338will be able to contribute new error and tracing event definitions.
339
340### `phosphor-logging`
341
342> TODO: Should a tracing event be a `Logging.Entry` with severity of
343> `Informational` or should they be a new type, such as `Logging.Event` and
344> managed separately. The `phosphor-logging` default `meson.options` have
345> `error_cap=200` and `error_info_cap=10`. If we increase the total number of
346> events allowed to 10K, the majority of them are likely going to be information
347> / tracing events.
348
349The `Logging.Entry` interface's `AdditionalData` property should change to
350`dict[string, variant[string,int64_t,size_t,object_path]]`.
351
352The `Logging.Create` interface will have a new method added:
353
354```yaml
355- name: CreateEntry
356  parameters:
357    - name: Message
358      type: string
359    - name: Severity
360      type: enum[Logging.Entry.Level]
361    - name: AdditionalData
362      type: dict[string, variant[string,int64_t,size_t,object_path]]
363    - name: Hint
364      type: string
365      default: ""
366  returns:
367    - name: Entry
368      type: object_path
369```
370
371The `Hint` parameter is used for daemons to be able to query for their
372previously recorded error, for marking as resolved. These strings need to be
373globally unique and are suggested to be of the format `"<service_name>:<key>"`.
374
375A `Logging.SearchHint` interface will be created, which will be recorded at the
376same object path as a `Logging.Entry` when the `Hint` parameter was not an empty
377string:
378
379```yaml
380- property: Hint
381  type: string
382```
383
384The `Logging.Manager` interface will be added with a single method:
385
386```yaml
387- name: FindEntry
388  parameters:
389    - name: Hint
390      type: String
391  returns:
392    - name: Entry
393      type: object_path
394  errors:
395    - xyz.openbmc_project.Common.ResourceNotFound
396```
397
398A `lg2::commit` API will be added to support the new `sdbusplus` generated
399exception types, calling the new `Logging.Create.CreateEntry` method proposed
400earlier. This new API will support `sdbusplus::bus_t` for synchronous DBus
401operations and both `sdbusplus::async::context_t` and
402`sdbusplus::asio::connection` for asynchronous DBus operations.
403
404There are outstanding performance concerns with the `phosphor-logging`
405implementation that may impact the ability for scaling to 10,000 event records.
406This issue is expected to be self-contained within `phosphor-logging`, except
407for potential future changes to the log-retrieval interfaces used by `bmcweb`.
408In order to decouple the transition to this design, by callers of the logging
409APIs, from the experimentation and improvements in `phosphor-logging`, we will
410add a compile option and Yocto `DISTRO_FEATURE` that can turn `lg2::commit`
411behavior into an `OPENBMC_MESSAGE_ID` record in the journal, along the same
412approach as the previous `REDFISH_MESSAGE_ID`, and corresponding `rsyslog`
413configuration and `bmcweb` support to use these directly. This will allow
414systems which knowingly scale to a large number of event records, using
415`rsyslog` mechanics, the same level of performance. One caveat of this support
416is that the hint and resolution behavior will not exist when that option is
417enabled.
418
419### `bmcweb`
420
421`bmcweb` already has support for build-time conversion from a Redfish Message
422Registry, codified in JSON, to header files it uses to serve the registry; this
423will be expanded to support Redfish Message Registries generated by `sdbusplus`.
424`bmcweb` will add a Meson option for additional message registries, provided
425from bitbake from `phosphor-dbus-interfaces` and vendor-specific event
426definitions as a path to a directory of Message Registry JSONs. Support will
427also be added for adding `phosphor-dbus-interfaces` as a Meson subproject for
428stand-alone testing.
429
430It is desirable for `sdbusplus` to generate a Redfish Message Registry directly,
431leveraging the existing scripts for integration with `bmcweb`. As part of this
432we would like to support mapping a `Logging.Entry` event to an existing
433standardized Redfish event (such as those in the Base registry). The generated
434information must contain the `Logging.Entry::Message` identifier, the
435`AdditionalData` to `MessageArgs` mapping, and the translation from the
436`Message` identifier to the Redfish Message ID (when the Message ID is not from
437"this" registry). In order to facilitate this, we will need to add OEM fields to
438the Redfish Message Registry JSON, which are only used by the `bmcweb`
439processing scripts, to generate the information necessary for this additional
440mapping.
441
442The `xyz.openbmc_project.Logging.Entry` to `LogEvent` conversion needs to be
443enhanced, to utilize these Message Registries, in four ways:
444
4451. A Base64-encoded JSON representation of the `Logging.Entry` will be assigned
446   to the `DiagnosticData` property.
447
4482. If the `Logging.Entry::Message` contains an identifier corresponding to a
449   Registry entry, the `MessageId` property will be set to the corresponding
450   Redfish Message ID. Otherwise, the `Logging.Entry::Message` will be used
451   directly with no further transformation (as is done today).
452
4533. If the `Logging.Entry::Message` contains an identifier corresponding to a
454   Registry entry, the `MessageArgs` property will be filled in by obtaining the
455   corresponding values from the `AdditionalData` dictionary and the `Message`
456   field will be generated from combining these values with the `Message` string
457   from the Registry.
458
4594. A mechanism should be implemented to translate DBus `object_path` references
460   to Redfish Resource URIs. When an `object_path` cannot be translated,
461   `bmcweb` will use a prefix such as `object_path:` in the `MessageArgs` value.
462
463The implementation of `EventService` should be enhanced to support
464`phosphor-logging` hosted events. The implementation of `LogService` should be
465enhanced to support log paging for `phosphor-logging` hosted events.
466
467### `phosphor-sel-logger`
468
469The `phosphor-sel-logger` has a meson option `send-to-logger` which toggles
470between using `phosphor-logging` or the [`REDFISH_MESSAGE_ID`
471mechanism][existing-design]. The `phosphor-logging`-utilizing paths will be
472updated to utilize `phosphor-dbus-interfaces` specified errors and events.
473
474### YAML format
475
476Consider an example file in `phosphor-dbus-interfaces` as
477`yaml/xyz/openbmc_project/Software/Update.events.yaml` with hypothetical errors
478and events:
479
480```yaml
481version: 1.3.1
482
483errors:
484  - name: UpdateFailure
485    severity: critical
486    metadata:
487      - name: TARGET
488        type: string
489        primary: true
490      - name: ERRNO
491        type: int64
492      - name: CALLOUT_HARDWARE
493        type: object_path
494        primary: true
495    en:
496      description: While updating the firmware on a device, the update failed.
497      message: A failure occurred updating {TARGET} on {CALLOUT_HARDWARE}.
498      resolution: Retry update.
499
500  - name: BMCUpdateFailure
501    severity: critical
502    deprecated: 1.0.0
503    en:
504      description: Failed to update the BMC
505    redfish-mapping: OpenBMC.FirmwareUpdateFailed
506
507events:
508  - name: UpdateProgress
509    metadata:
510      - name: TARGET
511        type: string
512        primary: true
513      - name: COMPLETION
514        type: double
515        primary: true
516    en:
517      description: An update is in progress and has reached a checkpoint.
518      message: Updating of {TARGET} is {COMPLETION}% complete.
519```
520
521Each `foo.events.yaml` file would be used to generate both the C++ classes (via
522`sdbusplus`) for exception handling and event reporting, as well as a versioned
523Redfish Message Registry for the errors and events. The YAML schema is as
524follows:
525
526```yaml
527$id: https://openbmc-project.xyz/sdbusplus/events.schema.yaml
528$schema: https://json-schema.org/draft/2020-12/schema
529title: Event and error definitions
530type: object
531$defs:
532  event:
533    type: array
534    items:
535      type: object
536      properties:
537        name:
538          type: string
539          description:
540            An identifier for the event in UpperCamelCase; used as the class and
541            Redfish Message ID.
542        en:
543          type: object
544          description: The details for English.
545          properties:
546            description:
547              type: string
548              description:
549                A developer-applicable description of the error reported. These
550                form the "description" of the Redfish message.
551            message:
552              type: string
553              description:
554                The end-user message, including placeholders for arguemnts.
555            resolution:
556              type: string
557              description: The end-user resolution.
558        severity:
559          enum:
560            - emergency
561            - alert
562            - critical
563            - error
564            - warning
565            - notice
566            - informational
567            - debug
568          description:
569            The `xyz.openbmc_project.Logging.Entry.Level` value for this
570            error.  Only applicable for 'errors'.
571        redfish-mapping:
572          type: string
573          description:
574            Used when a `sdbusplus` event should map to a specific Redfish
575            Message rather than a generated one. This is useful when an internal
576            error has an analog in a standardized registry.
577        deprecated:
578          type: string
579          pattern: "^[0-9]+\\.[0-9]+\\.[0-9]+$"
580          description:
581            Indicates that the event is now deprecated and should not be created
582            by any OpenBMC software, but is required to still exist for
583            generation in the Redfish Message Registry. The version listed here
584            should be the first version where the error is no longer used.
585        metadata:
586          type: array
587          items:
588            type: object
589            properties:
590              name:
591                type: string
592                description: The name of the metadata field.
593              type:
594                enum:
595                  - string
596                  - size
597                  - int64
598                  - uint64
599                  - double
600                  - object_path
601                description: The type of the metadata field.
602              primary:
603                type: boolean
604                description:
605                  Set to true when the metadata field is expected to be part of
606                  the Redfish `MessageArgs` (and not only in the extended
607                  `DiagnosticData`).
608properties:
609  version:
610    type: string
611    pattern: "^[0-9]+\\.[0-9]+\\.[0-9]+$"
612    description:
613      The version of the file, which will be used as the Redfish Message
614      Registry version.
615errors:
616  $ref: "#/definitions/event"
617events:
618  $ref: ":#/definitions/event"
619```
620
621The above example YAML would generate C++ classes similar to:
622
623```cpp
624namespace sdbusplus::errors::xyz::openbmc_project::software::update
625{
626
627class UpdateFailure
628{
629
630    template <typename... Args>
631    UpdateFailure(Args&&... args);
632};
633
634}
635
636namespace sdbusplus::events::xyz::openbmc_project::software::update
637{
638
639class UpdateProgress
640{
641    template <typename... Args>
642    UpdateProgress(Args&&... args);
643};
644
645}
646```
647
648The constructors here are variadic templates because the generated constructor
649implementation will provide compile-time assurance that all of the metadata
650fields have been populated (in any order). To raise an `UpdateFailure` a
651developers might do something like:
652
653```cpp
654// Immediately report the event:
655lg2::commit(UpdateFailure("TARGET", "BMC Flash A", "ERRNO", rc, "CALLOUT_HARDWARE", bmc_object_path));
656// or send it in a dbus response (when using sdbusplus generated binding):
657throw UpdateFailure("TARGET", "BMC Flash A", "ERRNO", rc, "CALLOUT_HARDWARE", bmc_object_path);
658```
659
660If one of the fields, such as `ERRNO` were omitted, a compile failure will be
661raised indicating the first missing field.
662
663### Versioning Policy
664
665Assume the version follows semantic versioning `MAJOR.MINOR.PATCH` convention.
666
667- Adjusting a description or message should result in a `PATCH` increment.
668- Adding a new error or event, or adding metadata to an existing error or event,
669  should result in a `MINOR` increment.
670- Deprecating an error or event should result in a `MAJOR` increment.
671
672There is [guidance on maintenance][registry-guidance] of the OpenBMC Message
673Registry. We will incorporate that guidance into the equivalent
674`phosphor-dbus-interfaces` policy.
675
676[registry-guidance]:
677  https://github.com/openbmc/bmcweb/blob/master/redfish-core/include/registries/openbmc_message_registry.readmefirst.md
678
679### Generated Redfish Message Registry
680
681[DSP0266][dsp0266], the Redfish specification, gives requirements for Redfish
682Message Registries and dictates guidelines for identifiers.
683
684The hypothetical events defined above would create a message registry similar
685to:
686
687```json
688{
689  "Id": "OpenBMC_Base_Xyz_OpenbmcProject_Software_Update.1.3.1",
690  "Language": "en",
691  "Messages": {
692    "UpdateFailure": {
693      "Description": "While updating the firmware on a device, the update failed.",
694      "Message": "A failure occurred updating %1 on %2.",
695      "Resolution": "Retry update."
696      "NumberOfArgs": 2,
697      "ParamTypes": ["string", "string"],
698      "Severity": "Critical",
699    },
700    "UpdateProgress" : {
701      "Description": "An update is in progress and has reached a checkpoint."
702      "Message": "Updating of %1 is %2\% complete.",
703      "Resolution": "None",
704      "NumberOfArgs": 2,
705      "ParamTypes": ["string", "number"],
706      "Severity": "OK",
707    }
708  }
709}
710```
711
712The prefix `OpenBMC_Base` shall be exclusively reserved for use by events from
713`phosphor-logging`. Events defined in other repositories will be expected to use
714some other prefix. Vendor-defined repositories should use a vendor-owned prefix
715as directed by [DSP0266][dsp0266].
716
717[dsp0266]:
718  https://www.dmtf.org/sites/default/files/standards/documents/DSP0266_1.20.0.pdf
719
720### Vendor implications
721
722As specified above, vendors must use their own identifiers in order to conform
723with the Redfish specification (see [DSP0266][dsp0266] for requirements on
724identifier naming). The `sdbusplus` (and `phosphor-logging` and `bmcweb`)
725implementation(s) will enable vendors to create their own events for downstream
726code and Registries for integration with Redfish, by creating downstream
727repositories of error definitions. Vendors are responsible for ensuring their
728own versioning and identifiers conform to the expectations in the [Redfish
729specification][dsp0266].
730
731One potential bad behavior on the part of vendors would be forking and modifying
732`phosphor-dbus-interfaces` defined events. Vendors must not add their own events
733to `phosphor-dbus-interfaces` in downstream implementations because it would
734lead to their implementation advertising support for a message in an
735OpenBMC-owned Registry which is not the case, but they should add them to their
736own repositories with a separate identifier. Similarly, if a vendor were to
737_backport_ upstream changes into their fork, they would need to ensure that the
738`foo.events.yaml` file for that version matches identically with the upstream
739implementation.
740
741## Alternatives Considered
742
743Many alternatives have been explored and referenced through earlier work. Within
744this proposal there are many minor-alternatives that have been assessed.
745
746### Exception inheritance
747
748The original `phosphor-logging` error descriptions allowed inheritance between
749two errors. This is not supported by the proposal for two reasons:
750
751- This introduces complexity in the Redfish Message Registry versioning because
752  a change in one file should induce version changes in all dependent files.
753
754- It makes it difficult for a developer to clearly identify all of the fields
755  they are expected to populate without traversing multiple files.
756
757### sdbusplus Exception APIs
758
759There are a few possible syntaxes I came up with for constructing the generated
760exception types. It is important that these have good ergonomics, are easy to
761understand, and can provide compile-time awareness of missing metadata fields.
762
763```cpp
764    using Example = sdbusplus::error::xyz::openbmc_project::Example;
765
766    // 1)
767    throw Example().fru("Motherboard").value(42);
768
769    // 2)
770    throw Example(Example::fru_{}, "Motherboard", Example::value_{}, 42);
771
772    // 3)
773    throw Example("FRU", "Motherboard", "VALUE", 42);
774
775    // 4)
776    throw Example([](auto e) { return e.fru("Motherboard").value(42); });
777
778    // 5)
779    throw Example({.fru = "Motherboard", .value = 42});
780```
781
782**Note**: These examples are all show using `throw` syntax, but could also be
783saved in local variables, returned from functions, or immediately passed to
784`lg2::commit`.
785
7861. This would be my preference for ergonomics and clarity, as it would allow
787   LSP-enabled editors to give completions for the metadata fields but
788   unfortunately there is no mechanism in C++ to define a type which can be
789   constructed but not thrown, which means we cannot get compile-time checking
790   of all metadata fields.
791
7922. This syntax uses tag-dispatch to enables compile-time checking of all
793   metadata fields and potential LSP-completion of the tag-types, but is more
794   verbose than option 3.
795
7963. This syntax is less verbose than (2) and follows conventions already used in
797   `phosphor-logging`'s `lg2` API, but does not allow LSP-completion of the
798   metadata tags.
799
8004. This syntax is similar to option (1) but uses an indirection of a lambda to
801   enable compile-time checking that all metadata fields have been populated by
802   the lambda. The LSP-completion is likely not as strong as option (1), due to
803   the use of `auto`, and the lambda necessity will likely be a hang-up for
804   unfamiliar developers.
805
8065. This syntax has similar characteristics as option (1) but similarly does not
807   provide compile-time confirmation that all fields have been populated.
808
809The proposal therefore suggests option (3) is most suitable.
810
811### Redfish Translation Support
812
813The proposed YAML format allows future addition of translation but it is not
814enabled at this time. Future development could enable the Redfish Message
815Registry to be generated in multiple languages if the `message:language` exists
816for those languages.
817
818### Redfish Registry Versioning
819
820The Redfish Message Registries are required to be versioned and has 3 digit
821fields (ie. `XX.YY.ZZ`), but only the first 2 are suppose to be used in the
822Message ID. Rather than using the manually specified version we could take a few
823other approaches:
824
825- Use a date code (ex. `2024.17.x`) representing the ISO 8601 week when the
826  registry was built.
827
828  - This does not cover vendors that may choose to branch for stabilization
829    purposes, so we can end up with two machines having the same
830    OpenBMC-versioned message registry with different content.
831
832- Use the most recent `openbmc/openbmc` tag as the version.
833
834  - This does not cover vendors that build off HEAD and may deploy multiple
835    images between two OpenBMC releases.
836
837- Generate the version based on the git-history.
838
839  - This requires `phosphor-dbus-interfaces` to be built from a git repository,
840    which may not always be true for Yocto source mirrors, and requires
841    non-trivial processing that continues to scale over time.
842
843### Existing OpenBMC Redfish Registry
844
845There are currently 191 messages defined in the existing Redfish Message
846Registry at version `OpenBMC.0.4.0`. Of those, not a single one in the codebase
847is emitted with the correct version. 96 of those are only emitted by
848Intel-specific code that is not pulled into any upstreamed machine, 39 are
849emitted by potentially common code, and 56 are not even referenced in the
850codebase outside of the bmcweb registry. Of the 39 common messages half of them
851have an equivalent in one of the standard registries that should be leveraged
852and many of the others do not have attributes that would facilitate a multi-host
853configuration, so the registry at a minimum needs to be updated. None of the
854current implementation has the capability to handle Redfish Resource URIs.
855
856The proposal therefore is to deprecate the existing registry and replace it with
857the new generated registries. For repositories that currently emit events in the
858existing format, we can maintain those call-sites for a time period of 1-2
859years.
860
861If this aspect of the proposal is rejected, the YAML format allows mapping from
862`phosphor-dbus-interfaces` defined events to the current `OpenBMC.0.4.0`
863registry `MessageIds`.
864
865Potentially common:
866
867- phosphor-post-code-manager
868  - BIOSPOSTCode (unique)
869- dbus-sensors
870  - ChassisIntrusionDetected (unique)
871  - ChassisIntrusionReset (unique)
872  - FanInserted
873  - FanRedundancyLost (unique)
874  - FanRedudancyRegained (unique)
875  - FanRemoved
876  - LanLost
877  - LanRegained
878  - PowerSupplyConfigurationError (unique)
879  - PowerSupplyConfigurationErrorRecovered (unique)
880  - PowerSupplyFailed
881  - PowerSupplyFailurePredicted (unique)
882  - PowerSupplyFanFailed
883  - PowerSupplyFanRecovered
884  - PowerSupplyPowerLost
885  - PowerSupplyPowerRestored
886  - PowerSupplyPredictiedFailureRecovered (unique)
887  - PowerSupplyRecovered
888- phosphor-sel-logger
889  - IPMIWatchdog (unique)
890  - `SensorThreshold*` : 8 different events
891- phosphor-net-ipmid
892  - InvalidLoginAttempted (unique)
893- entity-manager
894  - InventoryAdded (unique)
895  - InventoryRemoved (unique)
896- estoraged
897  - ServiceStarted
898- x86-power-control
899  - NMIButtonPressed (unique)
900  - NMIDiagnosticInterrupt (unique)
901  - PowerButtonPressed (unique)
902  - PowerRestorePolicyApplied (unique)
903  - PowerSupplyPowerGoodFailed (unique)
904  - ResetButtonPressed (unique)
905  - SystemPowerGoodFailed (unique)
906
907Intel-only implementations:
908
909- intel-ipmi-oem
910  - ADDDCCorrectable
911  - BIOSPostERROR
912  - BIOSRecoveryComplete
913  - BIOSRecoveryStart
914  - FirmwareUpdateCompleted
915  - IntelUPILinkWidthReducedToHalf
916  - IntelUPILinkWidthReducedToQuarter
917  - LegacyPCIPERR
918  - LegacyPCISERR
919  - `ME*` : 29 different events
920  - `Memory*` : 9 different events
921  - MirroringRedundancyDegraded
922  - MirroringRedundancyFull
923  - `PCIeCorrectable*`, `PCIeFatal` : 29 different events
924  - SELEntryAdded
925  - SparingRedundancyDegraded
926- pfr-manager
927  - BIOSFirmwareRecoveryReason
928  - BIOSFirmwarePanicReason
929  - BMCFirmwarePanicReason
930  - BMCFirmwareRecoveryReason
931  - BMCFirmwareResiliencyError
932  - CPLDFirmwarePanicReason
933  - CPLDFirmwareResilencyError
934  - FirmwareResiliencyError
935- host-error-monitor
936  - CPUError
937  - CPUMismatch
938  - CPUThermalTrip
939  - ComponentOverTemperature
940  - SsbThermalTrip
941  - VoltageRegulatorOverheated
942- s2600wf-misc
943  - DriveError
944  - InventoryAdded
945
946## Impacts
947
948- New APIs are defined for error and event logging. This will deprecate existing
949  `phosphor-logging` APIs, with a time to migrate, for error reporting.
950
951- The design should improve performance by eliminating the regular parsing of
952  the `systemd` journal. The design may decrease performance by allowing the
953  number of error and event logs to be dramatically increased, which have an
954  impact to file system utilization and potential for DBus impacts some services
955  such as `ObjectMapper`.
956
957- Backwards compatibility and documentation should be improved by the automatic
958  generation of the Redfish Message Registry corresponding to all error and
959  event reports.
960
961### Organizational
962
963- **Does this repository require a new repository?**
964  - No
965- **Who will be the initial maintainer(s) of this repository?**
966  - N/A
967- **Which repositories are expected to be modified to execute this design?**
968  - `sdbusplus`
969  - `phosphor-dbus-interfaces`
970  - `phosphor-logging`
971  - `bmcweb`
972  - Any repository creating an error or event.
973
974## Testing
975
976- Unit tests will be written in `sdbusplus` and `phosphor-logging` for the error
977  and event generation, creation APIs, and to provide coverage on any changes to
978  the `Logging.Entry` object management.
979
980- Unit tests will be written for `bmcweb` for basic `Logging.Entry`
981  transformation and Message Registry generation.
982
983- Integration tests should be leveraged (and enhanced as necessary) from
984  `openbmc-test-automation` to cover the end-to-end error creation and Redfish
985  reporting.
986