# Error and Event Logging Author: [Patrick Williams][patrick-email] `` [patrick-email]: mailto:patrick@stwcx.xyz Other contributors: Created: May 16, 2024 ## Problem Description There is currently not a consistent end-to-end error and event reporting design for the OpenBMC code stack. There are two different implementations, one primarily using phosphor-logging and one using rsyslog, both of which have gaps that a complete solution should address. This proposal is intended to be an end-to-end design handling both errors and tracing events which facilitate external management of the system in an automated and maintainable manner. ## Background and References ### Redfish LogEntry and Message Registry In Redfish, the [`LogEntry` schema][LogEntry] is used for a range of items that could be considered "logs", but one such use within OpenBMC is for an equivalent of the IPMI "System Event Log (SEL)". The IPMI SEL is the location where the BMC can collect errors and events, sometimes coming from other entities, such as the BIOS. Examples of these might be "DIMM-A0 encountered an uncorrectable ECC error" or "System boot successful". These SEL records are exposed as human readable strings, either natively by a OEM SEL design or by tools such as `ipmitool`, which are typically unique to each system or manufacturer, and could hypothethically change with a BMC or firmware update, and are thus difficult to create automated tooling around. Two different vendors might use different strings to represent a critical temperature threshold exceeded: ["temperature threshold exceeded"][HPE-Example] and ["Temperature #0x30 Upper Critical going high"][Oracle-Example]. There is also no mechanism with IPMI to ask the machine "what are all of the SELs you might create". In order to solve two aspects of this problem, listing of possible events and versioning, Redfish has Message Registries. A message registry is a versioned collection of all of the error events that a system could generate and hints as to how they might be parsed and displayed to a user. An [informative reference][Registry-Example] from the DMTF gives this example: ```json { "@odata.type": "#MessageRegistry.v1_0_0.MessageRegistry", "Id": "Alert.1.0.0", "RegistryPrefix": "Alert", "RegistryVersion": "1.0.0", "Messages": { "LanDisconnect": { "Description": "A LAN Disconnect on %1 was detected on system %2.", "Message": "A LAN Disconnect on %1 was detected on system %2.", "Severity": "Warning", "NumberOfArgs": 2, "Resolution": "None" } } } ``` This example defines an event, `Alert.1.0.LanDisconnect`, which can record the disconnect state of a network device and contains placeholders for the affected device and system. When this event occurs, there might be a `LogEntry` recorded containing something like: ```json { "Message": "A LAN Disconnnect on EthernetInterface 1 was detected on system /redfish/v1/Systems/1.", "MessageId": "Alert.1.0.LanDisconnect", "MessageArgs": ["EthernetInterface 1", "/redfish/v1/Systems/1"] } ``` The `Message` contains a human readable string which was created by applying the `MessageArgs` to the placeholders from the `Message` field in the registry. System management software can rely on the message registry (referenced from the `MessageId` field in the `LogEntry`) and `MessageArgs` to avoid needing to perform string processing for reacting to the event. Within OpenBMC, there is currently a [limited design][existing-design] for this Redfish feature and it requires inserting specially formed Redfish-specific logging messages into any application that wants to record these events, tightly coupling all applications to the Redfish implementation. It has also been observed that these [strings][app-example], when used, are often out of date with the [message registry][registry-example] advertised by `bmcweb`. Some maintainers have rejected adding new Redfish-specific logging messages to their applications. [LogEntry]: https://github.com/openbmc/bmcweb/blob/de0c960c4262169ea92a4b852dd5ebbe3810bf00/redfish-core/schema/dmtf/json-schema/LogEntry.v1_16_0.json [HPE-Example]: https://support.hpe.com/hpesc/public/docDisplay?docId=sd00002092en_us&docLocale=en_US&page=GUID-D7147C7F-2016-0901-06CE-000000000422.html [Oracle-Example]: https://docs.oracle.com/cd/E19464-01/820-6850-11/IPMItool.html#50602039_63068 [Registry-Example]: https://www.dmtf.org/sites/default/files/Redfish%20School%20-%20Events_0.pdf [existing-design]: https://github.com/openbmc/docs/blob/master/architecture/redfish-logging-in-bmcweb.md [app-example]: https://github.com/openbmc/phosphor-post-code-manager/blob/f2da78deb3a105c7270f74d9d747c77f0feaae2c/src/post_code.cpp#L143 [registry-example]: https://github.com/openbmc/bmcweb/blob/4ba5be51e3fcbeed49a6a312b4e6b2f1ea7447ba/redfish-core/include/registries/openbmc.json#L5 ### Existing phosphor-logging implementation **Note**: While the word 'exception' is used in this section, the existing (and proposed) types can be used by applications and execution contexts with exceptions disabled. They are 'exceptions' because they do inherit from `std::exception` and there is support in the `sdbusplus` bindings for them to be used in exception handling. The `sdbusplus` bindings have the capability to define new C++ exception types which can be thrown by a DBus server and turned into an error response to the client. `phosphor-logging` extended this to also add metadata associated to the log type. See the following example error definitions and usages. `sdbusplus` error binding definition (in `xyz/openbmc_project/Certs.errors.yaml`): ```yaml - name: InvalidCertificate description: Invalid certificate file. ``` `phosphor-logging` metadata definition (in `xyz/openbmc_project/Certs.metadata.yaml`): ```yaml - name: InvalidCertificate meta: - str: "REASON=%s" type: string ``` Application code reporting an error: ```cpp elog(Reason("Invalid certificate file format")); // or report(Reason("Existing certificate file is corrupted")); ``` In this sample, an error named `xyz.openbmc_project.Certs.Error.InvalidCertificate` has been defined, which can be sent between applications as a DBus response. The `InvalidCertificate` is expected to have additional metadata `REASON` which is a string. The two APIs `elog` and `report` have slightly different behaviors: `elog` throws an exception which can either result in an error DBus result or be handled elsewhere in the application, while `report` sends the event directly to `phosphor-logging`'s daemon for recording. As a side-effect of both calls, the metadata is inserted into the `systemd` journal. When an error is sent to the `phosphor-logging` daemon, it will: 1. Search back through the journal for recorded metadata associated with the event (this is a relative slow operation). 2. Create an [`xyz.openbmc_project.Logging.Entry`][Logging-Entry] DBus object with the associated data extracted from the journal. 3. Persist a serialized version of the object. Within `bmcweb` there is support for translating `xyz.openbmc_project.Logging.Entry` objects advertised by `phosphor-logging` into Redfish `LogEntries`, but this support does not reference a Message Registry. This makes the events of limited utility for consumption by system management software, as it cannot know all of the event types and is left to perform (hand-coded) regular-expressions to extract any information from the `Message` field of the `LogEntry`. Furthermore, these regular-expressions are likely to become outdated over time as internal OpenBMC error reporting structure, metadata, or message strings evolve. [Logging-Entry]: https://github.com/openbmc/phosphor-dbus-interfaces/blob/9012243e543abdc5851b7e878c17c991b2a2a8b7/yaml/xyz/openbmc_project/Logging/Entry.interface.yaml#L1 ### Issues with the Status Quo - There are two different implementations of error logging, neither of which are both complete and fully accepted by maintainers. These implementations also do not cover tracing events. - The `REDFISH_MESSAGE_ID` log approach leads to differences between the Redfish Message Registry and the reporting application. It also requires every application to be "Redfish aware" which limits decoupling between applications and external management interfaces. This also leaves gaps for reporting errors in different management interfaces, such as inband IPMI and PLDM. The approach also does not provide comple-time assurance of appropriate metadata collection, which can lead to producing code being out-of-date with the message registry definitions. - The `phosphor-logging` approach does not provide compile-time assurance of appropriate metadata collection and requires expensive daemon processing of the `systemd` journal on each error report, which limits scalability. - The `sdbusplus` bindings for error reporting do not currently handle lossless transmission of errors between DBus servers and clients. - Similar applications can result in different Redfish `LogEntry` for the same error scenario. This has been observed in sensor threshold exceeded events between `dbus-sensors`, `phosphor-hwmon`, `phosphor-virtual-sensor`, and `phosphor-health-monitor`. One cause of this is two different error reporting approaches and disagreements amongst maintainers as to the preferred approach. ## Requirements - Applications running on the BMC must be able to report errors and failure which are persisted and available for external system management through standards such as Redfish. - These errors must be structured, versioned, and the complete set of errors able to be created by the BMC should be available at built-time of a BMC image. - The set of errors, able to be created by the BMC, must be able to be transformed into relevant data sets, such as Redfish Message Registries. - For Redfish, the transformation must comply with the Redfish standard requirements, such as conforming to semantic versioning expectations. - For Redfish, the transformation should allow mapping internally defined events to pre-existing Redfish Message Registries for broader compatibility. - For Redfish, the implementation must also support the EventService mechanics for push-reporting. - Errors reported by the BMC should contain sufficient information to allow service of the system for these failures, either by humans or automation (depending on the individual system requirements). - Applications running on the BMC should be able to report important tracing events relevant to system management and/or debug, such as the system successfully reaching a running state. - All requirements relevant to errors are also applicable to tracing events. - The implementation must have a mechanism for vendors to be able to disable specific tracing events to conform to their own system design requirements. - Applications running on the BMC should be able to determine when a previously reported error is no longer relevant and mark it as "resolved", while maintaining the persistent record for future usages such as debug. - The BMC should provide a mechanism for managed entities within the server to report their own errors and events. Examples of managed entities would be firmware, such as the BIOS, and satellite management controllers. - The implementation on the BMC should scale to a minimum of [10,000][error-discussion] error and events without impacting the BMC or managed system performance. - The implementation should provide a mechanism to allow OEM or vendor extensions to the error and event definitions (and generated artifacts such as the Redfish Message Registry) for usage in closed-source or non-upstreamed code. These extensions must be clearly identified, in all interfaces, as vendor-specific and not be tied to the OpenBMC project. - APIs to implement error and event reporting should have good ergonomics. These APIs must provide compile-time identification, for applicable programming languages, of call sites which do not conform to the BMC error and event specifications. - The generated error classes and APIs should not require exceptions but should also integrate with the `sdbusplus` client and server bindings, which do leverage exceptions. [error-discussion]: https://discord.com/channels/775381525260664832/855566794994221117/867794201897992213 ## Proposed Design The proposed design has a few high-level design elements: - Consolidate the `sdbusplus` and `phosphor-logging` implementation of error reporting; expand it to cover tracing events; improve the ergonomics of the associated APIs and add compile-time checking of missing metadata. - Add APIs to `phosphor-logging` to enable daemons to easily look up their own previously reported events (for marking as resolved). - Add to `phosphor-logging` a compile-time mechanism to disable recording of specific tracing events for vendor-level customization. - Generate a Redfish Message Registry for all error and events defined in `phosphor-dbus-interfaces`, using binding generators from `sdbusplus`. Enhance `bmcweb` implementation of the `Logging.Entry` to `LogEvent` transformation to cover the Redfish Message Registry and `phosphor-logging` enhancements; Leverage the Redfish `LogEntry.DiagnosticData` field to provide a Base64-encoded JSON representation of the entire `Logging.Entry` for additional diagnostics [[does this need to be optional?]]. Add support to the `bmcweb` EventService implementation to support `phosphor-logging`-hosted events. ### `sdbusplus` The `Foo.errors.yaml` content will be combined with the content formerly in the `Foo.metadata.yaml` files specified by `phosphor-logging` and specified by a new file type `Foo.events.yaml`. This `Foo.events.yaml` format will cover both the current `error` and `metadata` information as well as augment with additional information necessary to generate external facing datasets, such as Redfish Message Registries. The current `Foo.errors.yaml` and `Foo.metadata.yaml` files will be deprecated as their usage is replaced by the new format. The `sdbusplus` library will be enhanced to provide the following: - JSON serialization and de-serialization of generated exception types with their assigned metadata; assignment of the JSON serialization to the `message` field of `sd_bus_error_set` calls when errors are returned from DBus server calls. - A facility to register exception types, at library load time, with the `sdbusplus` library for automatic conversion back to C++ exception types in DBus clients. The binding generator(s) will be expanded to do the following: - Generate complete C++ exception types, with compile-time checking of missing metadata and JSON serialization, for errors and events. Metadata can be of one of the following types: - size-type and signed integer - floating-point number - string - DBus object path - Generate a format that `bmcweb` can use to create and populate a Redfish Message Registry, and translate from `phosphor-logging` to Redfish `LogEntry` for a set of errors and events For general users of `sdbusplus` these changes should have no impact, except for the availability of new generated exception types and that specialized instances of `sdbusplus::exception::generated_exception` will become available in DBus clients. ### `phosphor-dbus-interfaces` Refactoring will be done to migrate existing `Foo.metadata.yaml` and `Foo.errors.yaml` content to the `Foo.events.yaml` as migration is done by applications. Minor changes will take place to utilize the new binding generators from `sdbusplus`. A small library enhancement will be done to register all generated exception types with `sdbusplus`. Future contributors will be able to contribute new error and tracing event definitions. ### `phosphor-logging` > TODO: Should a tracing event be a `Logging.Entry` with severity of > `Informational` or should they be a new type, such as `Logging.Event` and > managed separately. The `phosphor-logging` default `meson.options` have > `error_cap=200` and `error_info_cap=10`. If we increase the total number of > events allowed to 10K, the majority of them are likely going to be information > / tracing events. The `Logging.Entry` interface's `AdditionalData` property should change to `dict[string, variant[string,int64_t,size_t,object_path]]`. The `Logging.Create` interface will have a new method added: ```yaml - name: CreateEntry parameters: - name: Message type: string - name: Severity type: enum[Logging.Entry.Level] - name: AdditionalData type: dict[string, variant[string,int64_t,size_t,object_path]] - name: Hint type: string default: "" returns: - name: Entry type: object_path ``` The `Hint` parameter is used for daemons to be able to query for their previously recorded error, for marking as resolved. These strings need to be globally unique and are suggested to be of the format `":"`. A `Logging.SearchHint` interface will be created, which will be recorded at the same object path as a `Logging.Entry` when the `Hint` parameter was not an empty string: ```yaml - property: Hint type: string ``` The `Logging.Manager` interface will be added with a single method: ```yaml - name: FindEntry parameters: - name: Hint type: String returns: - name: Entry type: object_path errors: - xyz.openbmc_project.Common.ResourceNotFound ``` A `lg2::commit` API will be added to support the new `sdbusplus` generated exception types, calling the new `Logging.Create.CreateEntry` method proposed earlier. This new API will support `sdbusplus::bus_t` for synchronous DBus operations and both `sdbusplus::async::context_t` and `sdbusplus::asio::connection` for asynchronous DBus operations. There are outstanding performance concerns with the `phosphor-logging` implementation that may impact the ability for scaling to 10,000 event records. This issue is expected to be self-contained within `phosphor-logging`, except for potential future changes to the log-retrieval interfaces used by `bmcweb`. In order to decouple the transition to this design, by callers of the logging APIs, from the experimentation and improvements in `phosphor-logging`, we will add a compile option and Yocto `DISTRO_FEATURE` that can turn `lg2::commit` behavior into an `OPENBMC_MESSAGE_ID` record in the journal, along the same approach as the previous `REDFISH_MESSAGE_ID`, and corresponding `rsyslog` configuration and `bmcweb` support to use these directly. This will allow systems which knowingly scale to a large number of event records, using `rsyslog` mechanics, the same level of performance. One caveat of this support is that the hint and resolution behavior will not exist when that option is enabled. ### `bmcweb` `bmcweb` already has support for build-time conversion from a Redfish Message Registry, codified in JSON, to header files it uses to serve the registry; this will be expanded to support Redfish Message Registries generated by `sdbusplus`. `bmcweb` will add a Meson option for additional message registries, provided from bitbake from `phosphor-dbus-interfaces` and vendor-specific event definitions as a path to a directory of Message Registry JSONs. Support will also be added for adding `phosphor-dbus-interfaces` as a Meson subproject for stand-alone testing. It is desirable for `sdbusplus` to generate a Redfish Message Registry directly, leveraging the existing scripts for integration with `bmcweb`. As part of this we would like to support mapping a `Logging.Entry` event to an existing standardized Redfish event (such as those in the Base registry). The generated information must contain the `Logging.Entry::Message` identifier, the `AdditionalData` to `MessageArgs` mapping, and the translation from the `Message` identifier to the Redfish Message ID (when the Message ID is not from "this" registry). In order to facilitate this, we will need to add OEM fields to the Redfish Message Registry JSON, which are only used by the `bmcweb` processing scripts, to generate the information necessary for this additional mapping. The `xyz.openbmc_project.Logging.Entry` to `LogEvent` conversion needs to be enhanced, to utilize these Message Registries, in four ways: 1. A Base64-encoded JSON representation of the `Logging.Entry` will be assigned to the `DiagnosticData` property. 2. If the `Logging.Entry::Message` contains an identifier corresponding to a Registry entry, the `MessageId` property will be set to the corresponding Redfish Message ID. Otherwise, the `Logging.Entry::Message` will be used directly with no further transformation (as is done today). 3. If the `Logging.Entry::Message` contains an identifier corresponding to a Registry entry, the `MessageArgs` property will be filled in by obtaining the corresponding values from the `AdditionalData` dictionary and the `Message` field will be generated from combining these values with the `Message` string from the Registry. 4. A mechanism should be implemented to translate DBus `object_path` references to Redfish Resource URIs. When an `object_path` cannot be translated, `bmcweb` will use a prefix such as `object_path:` in the `MessageArgs` value. The implementation of `EventService` should be enhanced to support `phosphor-logging` hosted events. The implementation of `LogService` should be enhanced to support log paging for `phosphor-logging` hosted events. ### `phosphor-sel-logger` The `phosphor-sel-logger` has a meson option `send-to-logger` which toggles between using `phosphor-logging` or the [`REDFISH_MESSAGE_ID` mechanism][existing-design]. The `phosphor-logging`-utilizing paths will be updated to utilize `phosphor-dbus-interfaces` specified errors and events. ### YAML format Consider an example file in `phosphor-dbus-interfaces` as `yaml/xyz/openbmc_project/Software/Update.events.yaml` with hypothetical errors and events: ```yaml version: 1.3.1 errors: - name: UpdateFailure severity: critical metadata: - name: TARGET type: string primary: true - name: ERRNO type: int64 - name: CALLOUT_HARDWARE type: object_path primary: true en: description: While updating the firmware on a device, the update failed. message: A failure occurred updating {TARGET} on {CALLOUT_HARDWARE}. resolution: Retry update. - name: BMCUpdateFailure severity: critical deprecated: 1.0.0 en: description: Failed to update the BMC redfish-mapping: OpenBMC.FirmwareUpdateFailed events: - name: UpdateProgress metadata: - name: TARGET type: string primary: true - name: COMPLETION type: double primary: true en: description: An update is in progress and has reached a checkpoint. message: Updating of {TARGET} is {COMPLETION}% complete. ``` Each `foo.events.yaml` file would be used to generate both the C++ classes (via `sdbusplus`) for exception handling and event reporting, as well as a versioned Redfish Message Registry for the errors and events. The YAML schema is as follows: ```yaml $id: https://openbmc-project.xyz/sdbusplus/events.schema.yaml $schema: https://json-schema.org/draft/2020-12/schema title: Event and error definitions type: object $defs: event: type: array items: type: object properties: name: type: string description: An identifier for the event in UpperCamelCase; used as the class and Redfish Message ID. en: type: object description: The details for English. properties: description: type: string description: A developer-applicable description of the error reported. These form the "description" of the Redfish message. message: type: string description: The end-user message, including placeholders for arguemnts. resolution: type: string description: The end-user resolution. severity: enum: - emergency - alert - critical - error - warning - notice - informational - debug description: The `xyz.openbmc_project.Logging.Entry.Level` value for this error. Only applicable for 'errors'. redfish-mapping: type: string description: Used when a `sdbusplus` event should map to a specific Redfish Message rather than a generated one. This is useful when an internal error has an analog in a standardized registry. deprecated: type: string pattern: "^[0-9]+\\.[0-9]+\\.[0-9]+$" description: Indicates that the event is now deprecated and should not be created by any OpenBMC software, but is required to still exist for generation in the Redfish Message Registry. The version listed here should be the first version where the error is no longer used. metadata: type: array items: type: object properties: name: type: string description: The name of the metadata field. type: enum: - string - size - int64 - uint64 - double - object_path description: The type of the metadata field. primary: type: boolean description: Set to true when the metadata field is expected to be part of the Redfish `MessageArgs` (and not only in the extended `DiagnosticData`). properties: version: type: string pattern: "^[0-9]+\\.[0-9]+\\.[0-9]+$" description: The version of the file, which will be used as the Redfish Message Registry version. errors: $ref: "#/definitions/event" events: $ref: ":#/definitions/event" ``` The above example YAML would generate C++ classes similar to: ```cpp namespace sdbusplus::errors::xyz::openbmc_project::software::update { class UpdateFailure { template UpdateFailure(Args&&... args); }; } namespace sdbusplus::events::xyz::openbmc_project::software::update { class UpdateProgress { template UpdateProgress(Args&&... args); }; } ``` The constructors here are variadic templates because the generated constructor implementation will provide compile-time assurance that all of the metadata fields have been populated (in any order). To raise an `UpdateFailure` a developers might do something like: ```cpp // Immediately report the event: lg2::commit(UpdateFailure("TARGET", "BMC Flash A", "ERRNO", rc, "CALLOUT_HARDWARE", bmc_object_path)); // or send it in a dbus response (when using sdbusplus generated binding): throw UpdateFailure("TARGET", "BMC Flash A", "ERRNO", rc, "CALLOUT_HARDWARE", bmc_object_path); ``` If one of the fields, such as `ERRNO` were omitted, a compile failure will be raised indicating the first missing field. ### Versioning Policy Assume the version follows semantic versioning `MAJOR.MINOR.PATCH` convention. - Adjusting a description or message should result in a `PATCH` increment. - Adding a new error or event, or adding metadata to an existing error or event, should result in a `MINOR` increment. - Deprecating an error or event should result in a `MAJOR` increment. There is [guidance on maintenance][registry-guidance] of the OpenBMC Message Registry. We will incorporate that guidance into the equivalent `phosphor-dbus-interfaces` policy. [registry-guidance]: https://github.com/openbmc/bmcweb/blob/master/redfish-core/include/registries/openbmc_message_registry.readmefirst.md ### Generated Redfish Message Registry [DSP0266][dsp0266], the Redfish specification, gives requirements for Redfish Message Registries and dictates guidelines for identifiers. The hypothetical events defined above would create a message registry similar to: ```json { "Id": "OpenBMC_Base_Xyz_OpenbmcProject_Software_Update.1.3.1", "Language": "en", "Messages": { "UpdateFailure": { "Description": "While updating the firmware on a device, the update failed.", "Message": "A failure occurred updating %1 on %2.", "Resolution": "Retry update." "NumberOfArgs": 2, "ParamTypes": ["string", "string"], "Severity": "Critical", }, "UpdateProgress" : { "Description": "An update is in progress and has reached a checkpoint." "Message": "Updating of %1 is %2\% complete.", "Resolution": "None", "NumberOfArgs": 2, "ParamTypes": ["string", "number"], "Severity": "OK", } } } ``` The prefix `OpenBMC_Base` shall be exclusively reserved for use by events from `phosphor-logging`. Events defined in other repositories will be expected to use some other prefix. Vendor-defined repositories should use a vendor-owned prefix as directed by [DSP0266][dsp0266]. [dsp0266]: https://www.dmtf.org/sites/default/files/standards/documents/DSP0266_1.20.0.pdf ### Vendor implications As specified above, vendors must use their own identifiers in order to conform with the Redfish specification (see [DSP0266][dsp0266] for requirements on identifier naming). The `sdbusplus` (and `phosphor-logging` and `bmcweb`) implementation(s) will enable vendors to create their own events for downstream code and Registries for integration with Redfish, by creating downstream repositories of error definitions. Vendors are responsible for ensuring their own versioning and identifiers conform to the expectations in the [Redfish specification][dsp0266]. One potential bad behavior on the part of vendors would be forking and modifying `phosphor-dbus-interfaces` defined events. Vendors must not add their own events to `phosphor-dbus-interfaces` in downstream implementations because it would lead to their implementation advertising support for a message in an OpenBMC-owned Registry which is not the case, but they should add them to their own repositories with a separate identifier. Similarly, if a vendor were to _backport_ upstream changes into their fork, they would need to ensure that the `foo.events.yaml` file for that version matches identically with the upstream implementation. ## Alternatives Considered Many alternatives have been explored and referenced through earlier work. Within this proposal there are many minor-alternatives that have been assessed. ### Exception inheritance The original `phosphor-logging` error descriptions allowed inheritance between two errors. This is not supported by the proposal for two reasons: - This introduces complexity in the Redfish Message Registry versioning because a change in one file should induce version changes in all dependent files. - It makes it difficult for a developer to clearly identify all of the fields they are expected to populate without traversing multiple files. ### sdbusplus Exception APIs There are a few possible syntaxes I came up with for constructing the generated exception types. It is important that these have good ergonomics, are easy to understand, and can provide compile-time awareness of missing metadata fields. ```cpp using Example = sdbusplus::error::xyz::openbmc_project::Example; // 1) throw Example().fru("Motherboard").value(42); // 2) throw Example(Example::fru_{}, "Motherboard", Example::value_{}, 42); // 3) throw Example("FRU", "Motherboard", "VALUE", 42); // 4) throw Example([](auto e) { return e.fru("Motherboard").value(42); }); // 5) throw Example({.fru = "Motherboard", .value = 42}); ``` **Note**: These examples are all show using `throw` syntax, but could also be saved in local variables, returned from functions, or immediately passed to `lg2::commit`. 1. This would be my preference for ergonomics and clarity, as it would allow LSP-enabled editors to give completions for the metadata fields but unfortunately there is no mechanism in C++ to define a type which can be constructed but not thrown, which means we cannot get compile-time checking of all metadata fields. 2. This syntax uses tag-dispatch to enables compile-time checking of all metadata fields and potential LSP-completion of the tag-types, but is more verbose than option 3. 3. This syntax is less verbose than (2) and follows conventions already used in `phosphor-logging`'s `lg2` API, but does not allow LSP-completion of the metadata tags. 4. This syntax is similar to option (1) but uses an indirection of a lambda to enable compile-time checking that all metadata fields have been populated by the lambda. The LSP-completion is likely not as strong as option (1), due to the use of `auto`, and the lambda necessity will likely be a hang-up for unfamiliar developers. 5. This syntax has similar characteristics as option (1) but similarly does not provide compile-time confirmation that all fields have been populated. The proposal therefore suggests option (3) is most suitable. ### Redfish Translation Support The proposed YAML format allows future addition of translation but it is not enabled at this time. Future development could enable the Redfish Message Registry to be generated in multiple languages if the `message:language` exists for those languages. ### Redfish Registry Versioning The Redfish Message Registries are required to be versioned and has 3 digit fields (ie. `XX.YY.ZZ`), but only the first 2 are suppose to be used in the Message ID. Rather than using the manually specified version we could take a few other approaches: - Use a date code (ex. `2024.17.x`) representing the ISO 8601 week when the registry was built. - This does not cover vendors that may choose to branch for stabilization purposes, so we can end up with two machines having the same OpenBMC-versioned message registry with different content. - Use the most recent `openbmc/openbmc` tag as the version. - This does not cover vendors that build off HEAD and may deploy multiple images between two OpenBMC releases. - Generate the version based on the git-history. - This requires `phosphor-dbus-interfaces` to be built from a git repository, which may not always be true for Yocto source mirrors, and requires non-trivial processing that continues to scale over time. ### Existing OpenBMC Redfish Registry There are currently 191 messages defined in the existing Redfish Message Registry at version `OpenBMC.0.4.0`. Of those, not a single one in the codebase is emitted with the correct version. 96 of those are only emitted by Intel-specific code that is not pulled into any upstreamed machine, 39 are emitted by potentially common code, and 56 are not even referenced in the codebase outside of the bmcweb registry. Of the 39 common messages half of them have an equivalent in one of the standard registries that should be leveraged and many of the others do not have attributes that would facilitate a multi-host configuration, so the registry at a minimum needs to be updated. None of the current implementation has the capability to handle Redfish Resource URIs. The proposal therefore is to deprecate the existing registry and replace it with the new generated registries. For repositories that currently emit events in the existing format, we can maintain those call-sites for a time period of 1-2 years. If this aspect of the proposal is rejected, the YAML format allows mapping from `phosphor-dbus-interfaces` defined events to the current `OpenBMC.0.4.0` registry `MessageIds`. Potentially common: - phosphor-post-code-manager - BIOSPOSTCode (unique) - dbus-sensors - ChassisIntrusionDetected (unique) - ChassisIntrusionReset (unique) - FanInserted - FanRedundancyLost (unique) - FanRedudancyRegained (unique) - FanRemoved - LanLost - LanRegained - PowerSupplyConfigurationError (unique) - PowerSupplyConfigurationErrorRecovered (unique) - PowerSupplyFailed - PowerSupplyFailurePredicted (unique) - PowerSupplyFanFailed - PowerSupplyFanRecovered - PowerSupplyPowerLost - PowerSupplyPowerRestored - PowerSupplyPredictiedFailureRecovered (unique) - PowerSupplyRecovered - phosphor-sel-logger - IPMIWatchdog (unique) - `SensorThreshold*` : 8 different events - phosphor-net-ipmid - InvalidLoginAttempted (unique) - entity-manager - InventoryAdded (unique) - InventoryRemoved (unique) - estoraged - ServiceStarted - x86-power-control - NMIButtonPressed (unique) - NMIDiagnosticInterrupt (unique) - PowerButtonPressed (unique) - PowerRestorePolicyApplied (unique) - PowerSupplyPowerGoodFailed (unique) - ResetButtonPressed (unique) - SystemPowerGoodFailed (unique) Intel-only implementations: - intel-ipmi-oem - ADDDCCorrectable - BIOSPostERROR - BIOSRecoveryComplete - BIOSRecoveryStart - FirmwareUpdateCompleted - IntelUPILinkWidthReducedToHalf - IntelUPILinkWidthReducedToQuarter - LegacyPCIPERR - LegacyPCISERR - `ME*` : 29 different events - `Memory*` : 9 different events - MirroringRedundancyDegraded - MirroringRedundancyFull - `PCIeCorrectable*`, `PCIeFatal` : 29 different events - SELEntryAdded - SparingRedundancyDegraded - pfr-manager - BIOSFirmwareRecoveryReason - BIOSFirmwarePanicReason - BMCFirmwarePanicReason - BMCFirmwareRecoveryReason - BMCFirmwareResiliencyError - CPLDFirmwarePanicReason - CPLDFirmwareResilencyError - FirmwareResiliencyError - host-error-monitor - CPUError - CPUMismatch - CPUThermalTrip - ComponentOverTemperature - SsbThermalTrip - VoltageRegulatorOverheated - s2600wf-misc - DriveError - InventoryAdded ## Impacts - New APIs are defined for error and event logging. This will deprecate existing `phosphor-logging` APIs, with a time to migrate, for error reporting. - The design should improve performance by eliminating the regular parsing of the `systemd` journal. The design may decrease performance by allowing the number of error and event logs to be dramatically increased, which have an impact to file system utilization and potential for DBus impacts some services such as `ObjectMapper`. - Backwards compatibility and documentation should be improved by the automatic generation of the Redfish Message Registry corresponding to all error and event reports. ### Organizational - **Does this repository require a new repository?** - No - **Who will be the initial maintainer(s) of this repository?** - N/A - **Which repositories are expected to be modified to execute this design?** - `sdbusplus` - `phosphor-dbus-interfaces` - `phosphor-logging` - `bmcweb` - Any repository creating an error or event. ## Testing - Unit tests will be written in `sdbusplus` and `phosphor-logging` for the error and event generation, creation APIs, and to provide coverage on any changes to the `Logging.Entry` object management. - Unit tests will be written for `bmcweb` for basic `Logging.Entry` transformation and Message Registry generation. - Integration tests should be leveraged (and enhanced as necessary) from `openbmc-test-automation` to cover the end-to-end error creation and Redfish reporting.