xref: /openbmc/docs/designs/ecc-dbus-sel.md (revision f4febd00)
1### ECC Error SEL for BMC
2
3Author: Will Liang
4
5Created: 2019-02-26
6
7#### Problem Description
8
9The IPMI SELs only define memory Error Correction Code (ECC) errors for host
10memory rather than BMC.
11
12The aim of this proposal is to record ECC events from the BMC in the IPMI System
13Event Log (SEL). Whenever ECC occurs, the BMC generates an event with the
14appropriate information and adds it to the SEL.
15
16#### Background and References
17
18The IPMI specification defines memory system event log about ECC/other
19correctable or ECC/other uncorrectable and whether ECC/other correctable memory
20error logging limits are reached.[1]. The BMC ECC SEL will follow IPMI SEL
21format and creates BMC memory ECC event log.
22
23OpenBMC currently support for generating SEL entries based on parsing the D-Bus
24event log. It does not yet support the BMC ECC SEL feature in OpenBMC project.
25Therefore, the memory ECC information will be registered to D-Bus and generate
26memory ECC SEL as well.
27
28[[1]Intelligent Platform Management Interface Specification v2.0 rev 1.1, section 41](https://www.intel.com/content/www/us/en/servers/ipmi/ipmi-second-gen-interface-spec-v2-rev1-1.html)
29
30#### Requirements
31
32Currently, the OpenBMC project does not support ECC event logs in D-Bus because
33there is no relevant ECC information in the OpenBMC D-Bus architecture. The new
34ECC D-Bus information will be added to the OpenBMC project and an ECC monitor
35service will be created to fetch the ECC count (ce_count/ue_count) from the EDAC
36driver. And make sure the EDAC driver must be loaded and ECC/other correctable
37or ECC/other uncorrectable counts need to be obtained from the EDAC driver.
38
39#### Proposed Design
40
41ECC-enabled memory controllers can detect and correct errors in operating
42systems (such as certain versions of Linux, macOS, and Windows) that allow
43detection and correction of memory errors, which helps identify problems before
44they become catastrophic faulty memory module.
45
46Many ECC memory systems use an "external" EDAC between the CPU and the memory to
47fix memory error. Most host integrate EDAC into the CPU's integrated memory
48controller.
49
50According to Section 42.2 of the IPMI specification, Table 42 [2], these SEL
51sensor types will be defined as `Memory` and `Event Data 3` field can be used to
52provide an event extension. Therefore, the BMC ECC event sets "Event Data 3"
53with the value FEh to identify the BMC ECC error.
54
55[[2] Intelligent Platform Management Interface Specification v2.0 rev 1.1, section 42.2](https://www.intel.com/content/www/us/en/servers/ipmi/ipmi-second-gen-interface-spec-v2-rev1-1.html)
56
57The main purpose of this function is to provide the BMC with the ability to
58record ECC error SELs.
59
60There are two new applications for this design:
61
62- poll the ECC error count
63- create the ECC SEL
64
65It also devised a mechanism to limit the "maximum number" of logs to avoid
66creating a large number of correctable ECC logs. When the `maximum quantity` is
67reached, the ECC service will stop to record the ECC log. The `maximum quantity`
68(default:100) is saved in the configuration file, and the user can modify the
69value if necessary.
70
71##### phosphor-ecc.service
72
73This will always run the application and look up the ECC error count every
74second after service is started. On first start, it resets all correctable ECC
75counts and uncorrectable ECC counts in the EDAC driver.
76
77It also provide the following path on D-Bus:
78
79- bus name : `xyz.openbmc_project.Memory.ECC`
80- object path : `/xyz/openbmc_project/metrics/memory/BmcECC`
81- interface : `xyz.openbmc_project.Memory.MemoryECC`
82
83The interface with the following properties: | Property | Type | Description | |
84-------- | ---- | ----------- | | isLoggingLimitReached | bool | ECC logging
85reach limits| | ceCount| int64 | correctable ECC events | | ueCount| int64 |
86uncorrectable ECC events | | state| string | bmc ECC event state |
87
88The error types for `xyz::openbmc_project::Memory::Ecc::Error::ceCount` and
89`ueCount` and `isLoggingLimitReached` will be created which generated the error
90type for the ECC logs.
91
92##### Create the ECC SEL
93
94Use the `phosphor-sel-logger` package to record the following logs in BMC SEL
95format.
96
97- correctable ECC log : when fetching the `ce_count` from EDAC driver parameter
98  and the count exceeds previous count.
99- uncorrectable ECC log : when fetching the `ue_count` from EDAC driver
100  parameter and the count exceeds previous count.
101- logging limit reached log : When the correctable ECC log reaches the
102  `maximum quantity`.
103
104#### Alternatives Considered
105
106Another consideration is that there is no stopping the recording of the ECC
107logging mechanism. When the checks `ce_count` and value exceeds the previous
108value, it will record the ECC log. But this will encounter a lot of ECC logs,
109and BMC memory will also be occupied.
110
111#### Impacts
112
113This application implementation only needs to make some changes when creating
114the event log, so it has minimal impact on the rest of the system.
115
116#### Testing
117
118Depending on the platform hardware design, this test requires an ECC driver to
119make fake ECC errors and then check the scenario is good.
120