1### ECC Error SEL for BMC 2 3Author: Will Liang 4 5Primary assignee: Will Liang 6 7Created: 2019-02-26 8 9#### Problem Description 10 11The IPMI SELs only define memory Error Correction Code (ECC) errors for host 12memory rather than BMC. 13 14The aim of this proposal is to record ECC events from the BMC in the IPMI System 15Event Log (SEL). Whenever ECC occurs, the BMC generates an event with the 16appropriate information and adds it to the SEL. 17 18#### Background and References 19 20The IPMI specification defines memory system event log about ECC/other 21correctable or ECC/other uncorrectable and whether ECC/other correctable memory 22error logging limits are reached.[1]. The BMC ECC SEL will follow IPMI SEL 23format and creates BMC memory ECC event log. 24 25OpenBMC currently support for generating SEL entries based on parsing the D-Bus 26event log. It does not yet support the BMC ECC SEL feature in OpenBMC project. 27Therefore, the memory ECC information will be registered to D-Bus and generate 28memory ECC SEL as well. 29 30[[1]Intelligent Platform Management Interface Specification v2.0 rev 1.1, section 41](https://www.intel.com/content/www/us/en/servers/ipmi/ipmi-second-gen-interface-spec-v2-rev1-1.html) 31 32#### Requirements 33 34Currently, the OpenBMC project does not support ECC event logs in D-Bus because 35there is no relevant ECC information in the OpenBMC D-Bus architecture. 36The new ECC D-Bus information will be added to the OpenBMC project and an ECC 37monitor service will be created to fetch the ECC count (ce_count/ue_count) from 38the EDAC driver. And make sure the EDAC driver must be loaded and ECC/other 39correctable or ECC/other uncorrectable counts need to be obtained from the EDAC 40driver. 41 42#### Proposed Design 43 44ECC-enabled memory controllers can detect and correct errors in operating 45systems (such as certain versions of Linux, macOS, and Windows) that allow 46detection and correction of memory errors, which helps identify problems before 47they become catastrophic faulty memory module. 48 49Many ECC memory systems use an "external" EDAC between the CPU and the memory 50to fix memory error. Most host integrate EDAC into the CPU's integrated memory 51controller. 52 53According to Section 42.2 of the IPMI specification, Table 42 [2], these SEL 54sensor types will be defined as `Memory` and `Event Data 3` field can be used to 55provide an event extension. Therefore, the BMC ECC event sets "Event Data 3" 56with the value FEh to identify the BMC ECC error. 57 58[[2] Intelligent Platform Management Interface Specification v2.0 rev 1.1, section 42.2](https://www.intel.com/content/www/us/en/servers/ipmi/ipmi-second-gen-interface-spec-v2-rev1-1.html) 59 60The main purpose of this function is to provide the BMC with the ability to 61record ECC error SELs. 62 63There are two new applications for this design: 64 65- poll the ECC error count 66- create the ECC SEL 67 68It also devised a mechanism to limit the "maximum number" of logs to avoid 69creating a large number of correctable ECC logs. When the `maximum quantity` is 70reached, the ECC service will stop to record the ECC log. The `maximum quantity` 71(default:100) is saved in the configuration file, and the user can modify the 72value if necessary. 73 74##### phosphor-ecc.service 75 76This will always run the application and look up the ECC error count every 77second after service is started. On first start, it resets all correctable ECC 78counts and uncorrectable ECC counts in the EDAC driver. 79 80It also provide the following path on D-Bus: 81 82- bus name : `xyz.openbmc_project.Memory.ECC` 83- object path : `/xyz/openbmc_project/metrics/memory/BmcECC` 84- interface : `xyz.openbmc_project.Memory.MemoryECC` 85 86The interface with the following properties: 87| Property | Type | Description | 88| -------- | ---- | ----------- | 89| isLoggingLimitReached | bool | ECC logging reach limits| 90| ceCount| int64 | correctable ECC events | 91| ueCount| int64 | uncorrectable ECC events | 92| state| string | bmc ECC event state | 93 94The error types for `xyz::openbmc_project::Memory::Ecc::Error::ceCount` and 95`ueCount` and `isLoggingLimitReached` will be created which generated the error 96type for the ECC logs. 97 98##### Create the ECC SEL 99 100Use the `phosphor-sel-logger` package to record the following logs in BMC SEL 101format. 102 103- correctable ECC log : when fetching the `ce_count` from EDAC driver parameter 104 and the count exceeds previous count. 105- uncorrectable ECC log : when fetching the `ue_count` from EDAC driver parameter 106 and the count exceeds previous count. 107- logging limit reached log : When the correctable ECC log reaches the 108 `maximum quantity`. 109 110#### Alternatives Considered 111 112Another consideration is that there is no stopping the recording of the ECC 113logging mechanism. 114When the checks `ce_count` and value exceeds the previous value, it will record 115the ECC log. But this will encounter a lot of ECC logs, and BMC memory will 116also be occupied. 117 118#### Impacts 119 120This application implementation only needs to make some changes when 121creating the event log, so it has minimal impact on the rest of the system. 122 123#### Testing 124 125Depending on the platform hardware design, this test requires an ECC 126driver to make fake ECC errors and then check the scenario is good.