xref: /openbmc/docs/designs/bios-bmc-smm-error-logging.md (revision f4febd002df578bad816239b70950f84ea4567e8)
1# BIOS->BMC SMM Error Logging Queue Daemon
2
3Author:
4
5- Brandon Kim / brandonkim@google.com / @brandonk
6
7Other contributors:
8
9- Marco Cruz-Heredia / mcruzheredia@google.com
10
11Created: Mar 15, 2022
12
13## Problem Description
14
15We've identified use cases where the BIOS will go into System Management Mode
16(SMM) to provide error logs to the BMC, requiring messages to be sent as quickly
17as possible without a handshake / ack back from the BMC due to the time
18constraint that it's under. The goal of this daemon we are proposing is to
19implement a circular buffer over a shared BIOS->BMC buffer that the BIOS can
20fire-and-forget.
21
22## Background and References
23
24There are various ways of communicating between the BMC and the BIOS, but there
25are only a few that don't require a handshake and lets the data persist in
26shared memory. These will be listed in the "Alternatives Considered" section.
27
28Different BMC vendors support different methods such as Shared Memory (SHM, via
29LPC / eSPI) and P2A or PCI Mailbox, but the existing daemon that utilizes them
30do it over IPMI blob to communicate where and how much data has been transferred
31(see [phosphor-ipmi-flash](https://github.com/openbmc/phosphor-ipmi-flash) and
32[libmctp/astlpc](https://github.com/openbmc/libmctp/blob/master/docs/bindings/vendor-ibm-astlpc.md))
33
34## Requirements
35
36The fundamental requirements for this daemon are listed as follows:
37
381. The BMC shall initialize the shared buffer in a way that the BIOS can
39   recognize when it can write to the buffer
402. After initialization, the BIOS shall not have to wait for an ack back from
41   the BMC before any writes to the shared buffer (**no synchronization**)
423. The BIOS shall be the main writer to the shared buffer, with the BMC mainly
43   reading the payloads, only writing to the buffer to update the header
444. The BMC shall read new payloads from the shared buffer for further processing
455. The BIOS must be able to write a payload (~1KB) to the buffer within 50µs
46
47The shared buffer will be as big as the protocol allows for a given BMC platform
48(for Nuvoton's PCI Mailbox for NPCM 7xx as an example, 16KB) and each of the
49payloads is estimated to be less than 1KB.
50
51This daemon assumes that no other traffic will communicate through the given
52protocol. The circular buffer and its header will provide some protection
53against corruption, but it should not be relied upon.
54
55## Proposed Design
56
57The implementation of interfacing with the shared buffer will very closely
58follow [phosphor-ipmi-flash](https://github.com/openbmc/phosphor-ipmi-flash). In
59the future, it may be wise to extract out the PCI Mailbox, P2A and LPC as
60separate libraries shared between `phosphor-ipmi-flash` and this daemon to
61reduce duplication of code.
62
63Taken from Marco's (mcruzheredia@google.com) internal design document for the
64circular buffer, the data structure of its header will look like the following:
65
66| Name                                | Size                             | Offset      | Written by   | Description                                                                                                                                                                                                                                                                                                                                                                                                                                 |
67| ----------------------------------- | -------------------------------- | ----------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
68| BMC Interface Version               | 4 bytes                          | 0x0         | BMC at init  | Allows the BIOS to determine if it is compatible with the BMC                                                                                                                                                                                                                                                                                                                                                                               |
69| BIOS Interface Version              | 4 bytes                          | 0x4         | BIOS at init | Allows the BMC to determine if it is compatible with the BIOS                                                                                                                                                                                                                                                                                                                                                                               |
70| Magic Number                        | 16 bytes                         | 0x8         | BMC at init  | Magic number to set the state of the queue as described below. Written by BMC once the memory region is ready to be used. Must be checked by BIOS before logging. BMC can change this number when it suspects data corruption to prevent BIOS from writing anything during reinitialization                                                                                                                                                 |
71| Queue size                          | 3 bytes                          | 0x18        | BMC at init  | Indicates the size of the region allocated for the circular queue. Written by BMC on init only, should not change during runtime. **This includes the size of the header and UE region size**                                                                                                                                                                                                                                               |
72| Uncorrectable Error region size     | 2 bytes                          | 0x1b        | BMC at init  | Indicates the size of the region reserved for Uncorrectable Error (UE) logs. Written by BMC on init only, should not change during runtime                                                                                                                                                                                                                                                                                                  |
73| BMC flags                           | 4 bytes                          | 0x1d        | BMC          | <ul><li>BIT0 - BMC UE reserved region “switch”<ul><li>Toggled when BMC reads a UE from the reserved region.</li></ul><li>BIT1 - Overflow<ul><li>Lets BIOS know BMC has seen the overflow incident</li><li>Toggled when BMC acks the overflow incident</li></ul><li>BIT2 - BMC_READY<ul><li>BMC sets this bit once it has received any initialization information it needs to get from the BIOS before it’s ready to receive logs.</li></ul> |
74| BMC read pointer                    | 3 bytes                          | 0x21        | BMC          | Used to allow the BIOS to detect when the BMC was unable to read the previous error logs in time to prevent the circular buffer from overflowing.                                                                                                                                                                                                                                                                                           |
75| Padding                             | 4 bytes                          | 0x24        | Reserved     | Padding for 8 byte alignment                                                                                                                                                                                                                                                                                                                                                                                                                |
76| BIOS flags                          | 4 bytes                          | 0x28        | BIOS         | <ul><li>BIT0 - BIOS UE reserved region “switch”<ul><li> Toggled when BIOS writes a UE to the reserved region.</li></ul><li>BIT1 - Overflow<ul><li>Lets the BMC know that it missed an error log</li><li>Toggled when BIOS sees overflow and not already overflowed</li></ul><li>BIT2 - Incomplete Initialization<ul><li>Set when BIOS has attempted to initialize but did not see BMC ack back with `BMC_READY` bit in BMC flags</li></ul>  |
77| BIOS write pointer                  | 3 bytes                          | 0x2c        | BIOS         | Indicates where the next log will be written by BIOS. Used to tell BMC when it should read a new log                                                                                                                                                                                                                                                                                                                                        |
78| Padding                             | 1 byte                           | 0x2f        | Reserved     | Padding for 8 byte alignment                                                                                                                                                                                                                                                                                                                                                                                                                |
79| Uncorrectable Error reserved region | TBD1                             | 0x30        | BIOS         | Reserved region only for UE logs. This region is only used if the rest of the buffer is going to overflow and there is no unread UE log already in the region.                                                                                                                                                                                                                                                                              |
80| Error Logs from BIOS                | Size of the Buffer - 0x30 - TBD1 | 0x30 + TBD1 | BIOS         | Logs vary by type, so each log will self-describe with a header. This region will fill up the rest of the buffer                                                                                                                                                                                                                                                                                                                            |
81
82### Initialization
83
84This daemon will first initialize the shared buffer by writing zero to the whole
85buffer, then initializing the header's `BMC at init` fields before writing the
86`Magic Number`. Once the `Magic Number` is written to, the BIOS will assume that
87the shared buffer has been properly initialized, and will be able to start
88writing entries to it.
89
90If there are any further initialization between the BIOS and the BMC required,
91the BMC needs to set the `BMC_READY` bit in the BMC flags once the
92initialization completes. If the BIOS does not see the flag being set, the BIOS
93shall set the `Incomplete Initialization` flag to notify the BMC to reinitialize
94the buffer.
95
96### Reading and Processing
97
98This daemon will poll the buffer at a set interval (the exact number will be
99configurable as the processing time and performance of different platforms may
100require different polling rate) and once a new payload is detected, the payload
101will be processed by a library that can also be chosen and configured at
102compile-time.
103
104Note that the Uncorrectable Error logs have a reserved region as they contain
105critical information that we don't want to lose, and should be prioritized over
106normal error logs. This reserved region will be used to log a UE log only if an
107overflow of the normal error log queue is imminent and the BMC has acked that
108any preexisting UE log in this region has already been read using Bit0 of the
109`BMC flag`.
110
111An example of a processing library (and something we would like to push in our
112initial version of this daemon) would be an RDE decoder for processing a subset
113of Redfish Device Enablement (RDE) commands, and decoding its attached Binary
114Encoded JSON (BEJ) payloads.
115
116## Alternatives Considered
117
118- IPMI was considered, did not meet our speed requirement of writing 1KB entry
119  in about 50 microseconds.
120  - For reference, initial PCI Mailbox performance measurement showed 1KB entry
121    write took roughly 10 microseconds.
122- LPC / eSPI was also considered but our BMC's SHM buffer was limited to 4KB
123  which was not enough for our use case.
124- `libmctp` and MCTP PCIe VDM were considered.
125  - `libmctp`'s current implementation relies on LPC as the transport binding
126    and IPMI KCS for synchronization. LPC as discussed, does not fit our current
127    need and synchronization does not work.
128  - We may use MCTP PCIe VDM on our future platforms once we have more resources
129    with expertise both from the BMC and the BIOS side (which we currently lack)
130    for our current project timeline.
131
132## Impacts
133
134Reading from the buffer and processing it may hinder performance of the BMC,
135especially if the polling rate is set too high.
136
137### Organizational
138
139This design will require 2 repositories:
140
141- bios-bmc-smm-error-logger
142  - This repository will implement the daemon described in this document
143  - Proposed maintainer: wltu@google.com , brandonkim@google.com
144- libbej
145  - This repository will follow the PLDM RDE specification as much as possible
146    for RDE BEJ decoding (initially, encoding may come in the future) and will
147    host a library written in C
148  - Proposed maintainer: wltu@google.com , brandonkim@google.com
149
150## Testing
151
152Unit tests will cover each parts of the daemon, mainly:
153
154- Initialization
155- Circular buffer processing
156- Decoding / Processing library
157