1fd77f6baSMauro Carvalho Chehab.. include:: <isonum.txt> 2fd77f6baSMauro Carvalho Chehab 3fd77f6baSMauro Carvalho Chehab============================================ 4fd77f6baSMauro Carvalho ChehabReliability, Availability and Serviceability 5fd77f6baSMauro Carvalho Chehab============================================ 6fd77f6baSMauro Carvalho Chehab 7fd77f6baSMauro Carvalho ChehabRAS concepts 8fd77f6baSMauro Carvalho Chehab************ 9fd77f6baSMauro Carvalho Chehab 10fd77f6baSMauro Carvalho ChehabReliability, Availability and Serviceability (RAS) is a concept used on 119f02a486STamara Diaconitaservers meant to measure their robustness. 12fd77f6baSMauro Carvalho Chehab 13fd77f6baSMauro Carvalho ChehabReliability 14fd77f6baSMauro Carvalho Chehab is the probability that a system will produce correct outputs. 15fd77f6baSMauro Carvalho Chehab 16fd77f6baSMauro Carvalho Chehab * Generally measured as Mean Time Between Failures (MTBF) 17fd77f6baSMauro Carvalho Chehab * Enhanced by features that help to avoid, detect and repair hardware faults 18fd77f6baSMauro Carvalho Chehab 19fd77f6baSMauro Carvalho ChehabAvailability 20fd77f6baSMauro Carvalho Chehab is the probability that a system is operational at a given time 21fd77f6baSMauro Carvalho Chehab 22fd77f6baSMauro Carvalho Chehab * Generally measured as a percentage of downtime per a period of time 23fd77f6baSMauro Carvalho Chehab * Often uses mechanisms to detect and correct hardware faults in 24fd77f6baSMauro Carvalho Chehab runtime; 25fd77f6baSMauro Carvalho Chehab 26fd77f6baSMauro Carvalho ChehabServiceability (or maintainability) 27fd77f6baSMauro Carvalho Chehab is the simplicity and speed with which a system can be repaired or 28fd77f6baSMauro Carvalho Chehab maintained 29fd77f6baSMauro Carvalho Chehab 30fd77f6baSMauro Carvalho Chehab * Generally measured on Mean Time Between Repair (MTBR) 31fd77f6baSMauro Carvalho Chehab 32fd77f6baSMauro Carvalho ChehabImproving RAS 33fd77f6baSMauro Carvalho Chehab------------- 34fd77f6baSMauro Carvalho Chehab 35fd77f6baSMauro Carvalho ChehabIn order to reduce systems downtime, a system should be capable of detecting 36fd77f6baSMauro Carvalho Chehabhardware errors, and, when possible correcting them in runtime. It should 37fd77f6baSMauro Carvalho Chehabalso provide mechanisms to detect hardware degradation, in order to warn 38fd77f6baSMauro Carvalho Chehabthe system administrator to take the action of replacing a component before 39fd77f6baSMauro Carvalho Chehabit causes data loss or system downtime. 40fd77f6baSMauro Carvalho Chehab 41fd77f6baSMauro Carvalho ChehabAmong the monitoring measures, the most usual ones include: 42fd77f6baSMauro Carvalho Chehab 43fd77f6baSMauro Carvalho Chehab* CPU – detect errors at instruction execution and at L1/L2/L3 caches; 44fd77f6baSMauro Carvalho Chehab* Memory – add error correction logic (ECC) to detect and correct errors; 459f02a486STamara Diaconita* I/O – add CRC checksums for transferred data; 46fd77f6baSMauro Carvalho Chehab* Storage – RAID, journal file systems, checksums, 47fd77f6baSMauro Carvalho Chehab Self-Monitoring, Analysis and Reporting Technology (SMART). 48fd77f6baSMauro Carvalho Chehab 49fd77f6baSMauro Carvalho ChehabBy monitoring the number of occurrences of error detections, it is possible 50fd77f6baSMauro Carvalho Chehabto identify if the probability of hardware errors is increasing, and, on such 519f02a486STamara Diaconitacase, do a preventive maintenance to replace a degraded component while 52fd77f6baSMauro Carvalho Chehabthose errors are correctable. 53fd77f6baSMauro Carvalho Chehab 54fd77f6baSMauro Carvalho ChehabTypes of errors 55fd77f6baSMauro Carvalho Chehab--------------- 56fd77f6baSMauro Carvalho Chehab 579d436edeSGeert UytterhoevenMost mechanisms used on modern systems use technologies like Hamming 58fd77f6baSMauro Carvalho ChehabCodes that allow error correction when the number of errors on a bit packet 59fd77f6baSMauro Carvalho Chehabis below a threshold. If the number of errors is above, those mechanisms 60fd77f6baSMauro Carvalho Chehabcan indicate with a high degree of confidence that an error happened, but 61fd77f6baSMauro Carvalho Chehabthey can't correct. 62fd77f6baSMauro Carvalho Chehab 63fd77f6baSMauro Carvalho ChehabAlso, sometimes an error occur on a component that it is not used. For 64fd77f6baSMauro Carvalho Chehabexample, a part of the memory that it is not currently allocated. 65fd77f6baSMauro Carvalho Chehab 66fd77f6baSMauro Carvalho ChehabThat defines some categories of errors: 67fd77f6baSMauro Carvalho Chehab 68fd77f6baSMauro Carvalho Chehab* **Correctable Error (CE)** - the error detection mechanism detected and 69fd77f6baSMauro Carvalho Chehab corrected the error. Such errors are usually not fatal, although some 70fd77f6baSMauro Carvalho Chehab Kernel mechanisms allow the system administrator to consider them as fatal. 71fd77f6baSMauro Carvalho Chehab 72fd77f6baSMauro Carvalho Chehab* **Uncorrected Error (UE)** - the amount of errors happened above the error 73fd77f6baSMauro Carvalho Chehab correction threshold, and the system was unable to auto-correct. 74fd77f6baSMauro Carvalho Chehab 75fd77f6baSMauro Carvalho Chehab* **Fatal Error** - when an UE error happens on a critical component of the 76fd77f6baSMauro Carvalho Chehab system (for example, a piece of the Kernel got corrupted by an UE), the 77fd77f6baSMauro Carvalho Chehab only reliable way to avoid data corruption is to hang or reboot the machine. 78fd77f6baSMauro Carvalho Chehab 79fd77f6baSMauro Carvalho Chehab* **Non-fatal Error** - when an UE error happens on an unused component, 80fd77f6baSMauro Carvalho Chehab like a CPU in power down state or an unused memory bank, the system may 81fd77f6baSMauro Carvalho Chehab still run, eventually replacing the affected hardware by a hot spare, 82fd77f6baSMauro Carvalho Chehab if available. 83fd77f6baSMauro Carvalho Chehab 849332ef9dSMasahiro Yamada Also, when an error happens on a userspace process, it is also possible to 85fd77f6baSMauro Carvalho Chehab kill such process and let userspace restart it. 86fd77f6baSMauro Carvalho Chehab 87fd77f6baSMauro Carvalho ChehabThe mechanism for handling non-fatal errors is usually complex and may 88fd77f6baSMauro Carvalho Chehabrequire the help of some userspace application, in order to apply the 89fd77f6baSMauro Carvalho Chehabpolicy desired by the system administrator. 90fd77f6baSMauro Carvalho Chehab 91fd77f6baSMauro Carvalho ChehabIdentifying a bad hardware component 92fd77f6baSMauro Carvalho Chehab------------------------------------ 93fd77f6baSMauro Carvalho Chehab 94fd77f6baSMauro Carvalho ChehabJust detecting a hardware flaw is usually not enough, as the system needs 95fd77f6baSMauro Carvalho Chehabto pinpoint to the minimal replaceable unit (MRU) that should be exchanged 96fd77f6baSMauro Carvalho Chehabto make the hardware reliable again. 97fd77f6baSMauro Carvalho Chehab 98fd77f6baSMauro Carvalho ChehabSo, it requires not only error logging facilities, but also mechanisms that 99fd77f6baSMauro Carvalho Chehabwill translate the error message to the silkscreen or component label for 100fd77f6baSMauro Carvalho Chehabthe MRU. 101fd77f6baSMauro Carvalho Chehab 102fd77f6baSMauro Carvalho ChehabTypically, it is very complex for memory, as modern CPUs interlace memory 103fd77f6baSMauro Carvalho Chehabfrom different memory modules, in order to provide a better performance. The 104fd77f6baSMauro Carvalho ChehabDMI BIOS usually have a list of memory module labels, with can be obtained 105fd77f6baSMauro Carvalho Chehabusing the ``dmidecode`` tool. For example, on a desktop machine, it shows:: 106fd77f6baSMauro Carvalho Chehab 107fd77f6baSMauro Carvalho Chehab Memory Device 108fd77f6baSMauro Carvalho Chehab Total Width: 64 bits 109fd77f6baSMauro Carvalho Chehab Data Width: 64 bits 110fd77f6baSMauro Carvalho Chehab Size: 16384 MB 111fd77f6baSMauro Carvalho Chehab Form Factor: SODIMM 112fd77f6baSMauro Carvalho Chehab Set: None 113fd77f6baSMauro Carvalho Chehab Locator: ChannelA-DIMM0 114fd77f6baSMauro Carvalho Chehab Bank Locator: BANK 0 115fd77f6baSMauro Carvalho Chehab Type: DDR4 116fd77f6baSMauro Carvalho Chehab Type Detail: Synchronous 117fd77f6baSMauro Carvalho Chehab Speed: 2133 MHz 118fd77f6baSMauro Carvalho Chehab Rank: 2 119fd77f6baSMauro Carvalho Chehab Configured Clock Speed: 2133 MHz 120fd77f6baSMauro Carvalho Chehab 121fd77f6baSMauro Carvalho ChehabOn the above example, a DDR4 SO-DIMM memory module is located at the 122fd77f6baSMauro Carvalho Chehabsystem's memory labeled as "BANK 0", as given by the *bank locator* field. 123fd77f6baSMauro Carvalho ChehabPlease notice that, on such system, the *total width* is equal to the 1249f02a486STamara Diaconita*data width*. It means that such memory module doesn't have error 125fd77f6baSMauro Carvalho Chehabdetection/correction mechanisms. 126fd77f6baSMauro Carvalho Chehab 127fd77f6baSMauro Carvalho ChehabUnfortunately, not all systems use the same field to specify the memory 128fd77f6baSMauro Carvalho Chehabbank. On this example, from an older server, ``dmidecode`` shows:: 129fd77f6baSMauro Carvalho Chehab 130fd77f6baSMauro Carvalho Chehab Memory Device 131fd77f6baSMauro Carvalho Chehab Array Handle: 0x1000 132fd77f6baSMauro Carvalho Chehab Error Information Handle: Not Provided 133fd77f6baSMauro Carvalho Chehab Total Width: 72 bits 134fd77f6baSMauro Carvalho Chehab Data Width: 64 bits 135fd77f6baSMauro Carvalho Chehab Size: 8192 MB 136fd77f6baSMauro Carvalho Chehab Form Factor: DIMM 137fd77f6baSMauro Carvalho Chehab Set: 1 138fd77f6baSMauro Carvalho Chehab Locator: DIMM_A1 139fd77f6baSMauro Carvalho Chehab Bank Locator: Not Specified 140fd77f6baSMauro Carvalho Chehab Type: DDR3 141fd77f6baSMauro Carvalho Chehab Type Detail: Synchronous Registered (Buffered) 142fd77f6baSMauro Carvalho Chehab Speed: 1600 MHz 143fd77f6baSMauro Carvalho Chehab Rank: 2 144fd77f6baSMauro Carvalho Chehab Configured Clock Speed: 1600 MHz 145fd77f6baSMauro Carvalho Chehab 146fd77f6baSMauro Carvalho ChehabThere, the DDR3 RDIMM memory module is located at the system's memory labeled 147fd77f6baSMauro Carvalho Chehabas "DIMM_A1", as given by the *locator* field. Please notice that this 1489f02a486STamara Diaconitamemory module has 64 bits of *data width* and 72 bits of *total width*. So, 149fd77f6baSMauro Carvalho Chehabit has 8 extra bits to be used by error detection and correction mechanisms. 150fd77f6baSMauro Carvalho ChehabSuch kind of memory is called Error-correcting code memory (ECC memory). 151fd77f6baSMauro Carvalho Chehab 152fd77f6baSMauro Carvalho ChehabTo make things even worse, it is not uncommon that systems with different 153fd77f6baSMauro Carvalho Chehablabels on their system's board to use exactly the same BIOS, meaning that 154fd77f6baSMauro Carvalho Chehabthe labels provided by the BIOS won't match the real ones. 155fd77f6baSMauro Carvalho Chehab 156fd77f6baSMauro Carvalho ChehabECC memory 157fd77f6baSMauro Carvalho Chehab---------- 158fd77f6baSMauro Carvalho Chehab 159b17b24fcSWaiman LongAs mentioned in the previous section, ECC memory has extra bits to be 160b17b24fcSWaiman Longused for error correction. In the above example, a memory module has 161b17b24fcSWaiman Long64 bits of *data width*, and 72 bits of *total width*. The extra 8 162b17b24fcSWaiman Longbits which are used for the error detection and correction mechanisms 163b17b24fcSWaiman Longare referred to as the *syndrome*\ [#f1]_\ [#f2]_. 164fd77f6baSMauro Carvalho Chehab 165fd77f6baSMauro Carvalho ChehabSo, when the cpu requests the memory controller to write a word with 166fd77f6baSMauro Carvalho Chehab*data width*, the memory controller calculates the *syndrome* in real time, 167fd77f6baSMauro Carvalho Chehabusing Hamming code, or some other error correction code, like SECDED+, 168fd77f6baSMauro Carvalho Chehabproducing a code with *total width* size. Such code is then written 169fd77f6baSMauro Carvalho Chehabon the memory modules. 170fd77f6baSMauro Carvalho Chehab 171fd77f6baSMauro Carvalho ChehabAt read, the *total width* bits code is converted back, using the same 172fd77f6baSMauro Carvalho ChehabECC code used on write, producing a word with *data width* and a *syndrome*. 173fd77f6baSMauro Carvalho ChehabThe word with *data width* is sent to the CPU, even when errors happen. 174fd77f6baSMauro Carvalho Chehab 175fd77f6baSMauro Carvalho ChehabThe memory controller also looks at the *syndrome* in order to check if 176fd77f6baSMauro Carvalho Chehabthere was an error, and if the ECC code was able to fix such error. 177fd77f6baSMauro Carvalho ChehabIf the error was corrected, a Corrected Error (CE) happened. If not, an 178fd77f6baSMauro Carvalho ChehabUncorrected Error (UE) happened. 179fd77f6baSMauro Carvalho Chehab 180fd77f6baSMauro Carvalho ChehabThe information about the CE/UE errors is stored on some special registers 181fd77f6baSMauro Carvalho Chehabat the memory controller and can be accessed by reading such registers, 182fd77f6baSMauro Carvalho Chehabeither by BIOS, by some special CPUs or by Linux EDAC driver. On x86 64 183fd77f6baSMauro Carvalho Chehabbit CPUs, such errors can also be retrieved via the Machine Check 184fd77f6baSMauro Carvalho ChehabArchitecture (MCA)\ [#f3]_. 185fd77f6baSMauro Carvalho Chehab 186fd77f6baSMauro Carvalho Chehab.. [#f1] Please notice that several memory controllers allow operation on a 187fd77f6baSMauro Carvalho Chehab mode called "Lock-Step", where it groups two memory modules together, 188fd77f6baSMauro Carvalho Chehab doing 128-bit reads/writes. That gives 16 bits for error correction, with 1899f02a486STamara Diaconita significantly improves the error correction mechanism, at the expense 190fd77f6baSMauro Carvalho Chehab that, when an error happens, there's no way to know what memory module is 191fd77f6baSMauro Carvalho Chehab to blame. So, it has to blame both memory modules. 192fd77f6baSMauro Carvalho Chehab 193fd77f6baSMauro Carvalho Chehab.. [#f2] Some memory controllers also allow using memory in mirror mode. 194fd77f6baSMauro Carvalho Chehab On such mode, the same data is written to two memory modules. At read, 195fd77f6baSMauro Carvalho Chehab the system checks both memory modules, in order to check if both provide 196fd77f6baSMauro Carvalho Chehab identical data. On such configuration, when an error happens, there's no 197fd77f6baSMauro Carvalho Chehab way to know what memory module is to blame. So, it has to blame both 198fd77f6baSMauro Carvalho Chehab memory modules (or 4 memory modules, if the system is also on Lock-step 199fd77f6baSMauro Carvalho Chehab mode). 200fd77f6baSMauro Carvalho Chehab 201fd77f6baSMauro Carvalho Chehab.. [#f3] For more details about the Machine Check Architecture (MCA), 202*ff61f079SJonathan Corbet please read Documentation/arch/x86/x86_64/machinecheck.rst at the Kernel tree. 203fd77f6baSMauro Carvalho Chehab 204fd77f6baSMauro Carvalho ChehabEDAC - Error Detection And Correction 205fd77f6baSMauro Carvalho Chehab************************************* 206fd77f6baSMauro Carvalho Chehab 207fd77f6baSMauro Carvalho Chehab.. note:: 208fd77f6baSMauro Carvalho Chehab 209fd77f6baSMauro Carvalho Chehab "bluesmoke" was the name for this device driver subsystem when it 210fd77f6baSMauro Carvalho Chehab was "out-of-tree" and maintained at http://bluesmoke.sourceforge.net. 211fd77f6baSMauro Carvalho Chehab That site is mostly archaic now and can be used only for historical 212fd77f6baSMauro Carvalho Chehab purposes. 213fd77f6baSMauro Carvalho Chehab 214fd77f6baSMauro Carvalho Chehab When the subsystem was pushed upstream for the first time, on 21500aff956SMauro Carvalho Chehab Kernel 2.6.16, it was renamed to ``EDAC``. 216fd77f6baSMauro Carvalho Chehab 217fd77f6baSMauro Carvalho ChehabPurpose 218fd77f6baSMauro Carvalho Chehab------- 219fd77f6baSMauro Carvalho Chehab 220fd77f6baSMauro Carvalho ChehabThe ``edac`` kernel module's goal is to detect and report hardware errors 221fd77f6baSMauro Carvalho Chehabthat occur within the computer system running under linux. 222fd77f6baSMauro Carvalho Chehab 223fd77f6baSMauro Carvalho ChehabMemory 224fd77f6baSMauro Carvalho Chehab------ 225fd77f6baSMauro Carvalho Chehab 226fd77f6baSMauro Carvalho ChehabMemory Correctable Errors (CE) and Uncorrectable Errors (UE) are the 227fd77f6baSMauro Carvalho Chehabprimary errors being harvested. These types of errors are harvested by 228fd77f6baSMauro Carvalho Chehabthe ``edac_mc`` device. 229fd77f6baSMauro Carvalho Chehab 230fd77f6baSMauro Carvalho ChehabDetecting CE events, then harvesting those events and reporting them, 231fd77f6baSMauro Carvalho Chehab**can** but must not necessarily be a predictor of future UE events. With 232fd77f6baSMauro Carvalho ChehabCE events only, the system can and will continue to operate as no data 233fd77f6baSMauro Carvalho Chehabhas been damaged yet. 234fd77f6baSMauro Carvalho Chehab 235fd77f6baSMauro Carvalho ChehabHowever, preventive maintenance and proactive part replacement of memory 236fd77f6baSMauro Carvalho Chehabmodules exhibiting CEs can reduce the likelihood of the dreaded UE events 237fd77f6baSMauro Carvalho Chehaband system panics. 238fd77f6baSMauro Carvalho Chehab 239fd77f6baSMauro Carvalho ChehabOther hardware elements 240fd77f6baSMauro Carvalho Chehab----------------------- 241fd77f6baSMauro Carvalho Chehab 242fd77f6baSMauro Carvalho ChehabA new feature for EDAC, the ``edac_device`` class of device, was added in 243fd77f6baSMauro Carvalho Chehabthe 2.6.23 version of the kernel. 244fd77f6baSMauro Carvalho Chehab 245fd77f6baSMauro Carvalho ChehabThis new device type allows for non-memory type of ECC hardware detectors 246fd77f6baSMauro Carvalho Chehabto have their states harvested and presented to userspace via the sysfs 247fd77f6baSMauro Carvalho Chehabinterface. 248fd77f6baSMauro Carvalho Chehab 249fd77f6baSMauro Carvalho ChehabSome architectures have ECC detectors for L1, L2 and L3 caches, 250fd77f6baSMauro Carvalho Chehabalong with DMA engines, fabric switches, main data path switches, 251fd77f6baSMauro Carvalho Chehabinterconnections, and various other hardware data paths. If the hardware 252fd77f6baSMauro Carvalho Chehabreports it, then a edac_device device probably can be constructed to 253fd77f6baSMauro Carvalho Chehabharvest and present that to userspace. 254fd77f6baSMauro Carvalho Chehab 255fd77f6baSMauro Carvalho Chehab 256fd77f6baSMauro Carvalho ChehabPCI bus scanning 257fd77f6baSMauro Carvalho Chehab---------------- 258fd77f6baSMauro Carvalho Chehab 259fd77f6baSMauro Carvalho ChehabIn addition, PCI devices are scanned for PCI Bus Parity and SERR Errors 260fd77f6baSMauro Carvalho Chehabin order to determine if errors are occurring during data transfers. 261fd77f6baSMauro Carvalho Chehab 262fd77f6baSMauro Carvalho ChehabThe presence of PCI Parity errors must be examined with a grain of salt. 263fd77f6baSMauro Carvalho ChehabThere are several add-in adapters that do **not** follow the PCI specification 264fd77f6baSMauro Carvalho Chehabwith regards to Parity generation and reporting. The specification says 265fd77f6baSMauro Carvalho Chehabthe vendor should tie the parity status bits to 0 if they do not intend 266fd77f6baSMauro Carvalho Chehabto generate parity. Some vendors do not do this, and thus the parity bit 267fd77f6baSMauro Carvalho Chehabcan "float" giving false positives. 268fd77f6baSMauro Carvalho Chehab 269fd77f6baSMauro Carvalho ChehabThere is a PCI device attribute located in sysfs that is checked by 270fd77f6baSMauro Carvalho Chehabthe EDAC PCI scanning code. If that attribute is set, PCI parity/error 271fd77f6baSMauro Carvalho Chehabscanning is skipped for that device. The attribute is:: 272fd77f6baSMauro Carvalho Chehab 273fd77f6baSMauro Carvalho Chehab broken_parity_status 274fd77f6baSMauro Carvalho Chehab 275fd77f6baSMauro Carvalho Chehaband is located in ``/sys/devices/pci<XXX>/0000:XX:YY.Z`` directories for 276fd77f6baSMauro Carvalho ChehabPCI devices. 277fd77f6baSMauro Carvalho Chehab 278fd77f6baSMauro Carvalho Chehab 279fd77f6baSMauro Carvalho ChehabVersioning 280fd77f6baSMauro Carvalho Chehab---------- 281fd77f6baSMauro Carvalho Chehab 282fd77f6baSMauro Carvalho ChehabEDAC is composed of a "core" module (``edac_core.ko``) and several Memory 283fd77f6baSMauro Carvalho ChehabController (MC) driver modules. On a given system, the CORE is loaded 284fd77f6baSMauro Carvalho Chehaband one MC driver will be loaded. Both the CORE and the MC driver (or 285fd77f6baSMauro Carvalho Chehab``edac_device`` driver) have individual versions that reflect current 286fd77f6baSMauro Carvalho Chehabrelease level of their respective modules. 287fd77f6baSMauro Carvalho Chehab 288fd77f6baSMauro Carvalho ChehabThus, to "report" on what version a system is running, one must report 289fd77f6baSMauro Carvalho Chehabboth the CORE's and the MC driver's versions. 290fd77f6baSMauro Carvalho Chehab 291fd77f6baSMauro Carvalho Chehab 292fd77f6baSMauro Carvalho ChehabLoading 293fd77f6baSMauro Carvalho Chehab------- 294fd77f6baSMauro Carvalho Chehab 295fd77f6baSMauro Carvalho ChehabIf ``edac`` was statically linked with the kernel then no loading 296fd77f6baSMauro Carvalho Chehabis necessary. If ``edac`` was built as modules then simply modprobe 297fd77f6baSMauro Carvalho Chehabthe ``edac`` pieces that you need. You should be able to modprobe 298fd77f6baSMauro Carvalho Chehabhardware-specific modules and have the dependencies load the necessary 299fd77f6baSMauro Carvalho Chehabcore modules. 300fd77f6baSMauro Carvalho Chehab 301fd77f6baSMauro Carvalho ChehabExample:: 302fd77f6baSMauro Carvalho Chehab 303fd77f6baSMauro Carvalho Chehab $ modprobe amd76x_edac 304fd77f6baSMauro Carvalho Chehab 305fd77f6baSMauro Carvalho Chehabloads both the ``amd76x_edac.ko`` memory controller module and the 306fd77f6baSMauro Carvalho Chehab``edac_mc.ko`` core module. 307fd77f6baSMauro Carvalho Chehab 308fd77f6baSMauro Carvalho Chehab 309fd77f6baSMauro Carvalho ChehabSysfs interface 310fd77f6baSMauro Carvalho Chehab--------------- 311fd77f6baSMauro Carvalho Chehab 312fd77f6baSMauro Carvalho ChehabEDAC presents a ``sysfs`` interface for control and reporting purposes. It 313fd77f6baSMauro Carvalho Chehablives in the /sys/devices/system/edac directory. 314fd77f6baSMauro Carvalho Chehab 315fd77f6baSMauro Carvalho ChehabWithin this directory there currently reside 2 components: 316fd77f6baSMauro Carvalho Chehab 317fd77f6baSMauro Carvalho Chehab ======= ============================== 318fd77f6baSMauro Carvalho Chehab mc memory controller(s) system 319fd77f6baSMauro Carvalho Chehab pci PCI control and status system 320fd77f6baSMauro Carvalho Chehab ======= ============================== 321fd77f6baSMauro Carvalho Chehab 322fd77f6baSMauro Carvalho Chehab 323fd77f6baSMauro Carvalho Chehab 324fd77f6baSMauro Carvalho ChehabMemory Controller (mc) Model 325fd77f6baSMauro Carvalho Chehab---------------------------- 326fd77f6baSMauro Carvalho Chehab 327fd77f6baSMauro Carvalho ChehabEach ``mc`` device controls a set of memory modules [#f4]_. These modules 328fd77f6baSMauro Carvalho Chehabare laid out in a Chip-Select Row (``csrowX``) and Channel table (``chX``). 329fd77f6baSMauro Carvalho ChehabThere can be multiple csrows and multiple channels. 330fd77f6baSMauro Carvalho Chehab 331fd77f6baSMauro Carvalho Chehab.. [#f4] Nowadays, the term DIMM (Dual In-line Memory Module) is widely 332fd77f6baSMauro Carvalho Chehab used to refer to a memory module, although there are other memory 333778f3a96SRobert Richter packaging alternatives, like SO-DIMM, SIMM, etc. The UEFI 334778f3a96SRobert Richter specification (Version 2.7) defines a memory module in the Common 335778f3a96SRobert Richter Platform Error Record (CPER) section to be an SMBIOS Memory Device 336778f3a96SRobert Richter (Type 17). Along this document, and inside the EDAC subsystem, the term 337778f3a96SRobert Richter "dimm" is used for all memory modules, even when they use a 338778f3a96SRobert Richter different kind of packaging. 339fd77f6baSMauro Carvalho Chehab 340fd77f6baSMauro Carvalho ChehabMemory controllers allow for several csrows, with 8 csrows being a 341fd77f6baSMauro Carvalho Chehabtypical value. Yet, the actual number of csrows depends on the layout of 342fd77f6baSMauro Carvalho Chehaba given motherboard, memory controller and memory module characteristics. 343fd77f6baSMauro Carvalho Chehab 344fd77f6baSMauro Carvalho ChehabDual channels allow for dual data length (e. g. 128 bits, on 64 bit systems) 345fd77f6baSMauro Carvalho Chehabdata transfers to/from the CPU from/to memory. Some newer chipsets allow 346fd77f6baSMauro Carvalho Chehabfor more than 2 channels, like Fully Buffered DIMMs (FB-DIMMs) memory 347fd77f6baSMauro Carvalho Chehabcontrollers. The following example will assume 2 channels: 348fd77f6baSMauro Carvalho Chehab 349fd77f6baSMauro Carvalho Chehab +------------+-----------------------+ 35082a19551SJonathan Corbet | CS Rows | Channels | 35182a19551SJonathan Corbet +------------+-----------+-----------+ 35282a19551SJonathan Corbet | | ``ch0`` | ``ch1`` | 353fd77f6baSMauro Carvalho Chehab +============+===========+===========+ 354cfa20498SMauro Carvalho Chehab | |**DIMM_A0**|**DIMM_B0**| 355cfa20498SMauro Carvalho Chehab +------------+-----------+-----------+ 356cfa20498SMauro Carvalho Chehab | ``csrow0`` | rank0 | rank0 | 357cfa20498SMauro Carvalho Chehab +------------+-----------+-----------+ 358778f3a96SRobert Richter | ``csrow1`` | rank1 | rank1 | 359fd77f6baSMauro Carvalho Chehab +------------+-----------+-----------+ 360cfa20498SMauro Carvalho Chehab | |**DIMM_A1**|**DIMM_B1**| 361cfa20498SMauro Carvalho Chehab +------------+-----------+-----------+ 362cfa20498SMauro Carvalho Chehab | ``csrow2`` | rank0 | rank0 | 363cfa20498SMauro Carvalho Chehab +------------+-----------+-----------+ 364778f3a96SRobert Richter | ``csrow3`` | rank1 | rank1 | 365fd77f6baSMauro Carvalho Chehab +------------+-----------+-----------+ 366fd77f6baSMauro Carvalho Chehab 367fd77f6baSMauro Carvalho ChehabIn the above example, there are 4 physical slots on the motherboard 368fd77f6baSMauro Carvalho Chehabfor memory DIMMs: 369fd77f6baSMauro Carvalho Chehab 370fd77f6baSMauro Carvalho Chehab +---------+---------+ 371fd77f6baSMauro Carvalho Chehab | DIMM_A0 | DIMM_B0 | 372fd77f6baSMauro Carvalho Chehab +---------+---------+ 373fd77f6baSMauro Carvalho Chehab | DIMM_A1 | DIMM_B1 | 374fd77f6baSMauro Carvalho Chehab +---------+---------+ 375fd77f6baSMauro Carvalho Chehab 376fd77f6baSMauro Carvalho ChehabLabels for these slots are usually silk-screened on the motherboard. 377fd77f6baSMauro Carvalho ChehabSlots labeled ``A`` are channel 0 in this example. Slots labeled ``B`` are 378fd77f6baSMauro Carvalho Chehabchannel 1. Notice that there are two csrows possible on a physical DIMM. 379fd77f6baSMauro Carvalho ChehabThese csrows are allocated their csrow assignment based on the slot into 380fd77f6baSMauro Carvalho Chehabwhich the memory DIMM is placed. Thus, when 1 DIMM is placed in each 381fd77f6baSMauro Carvalho ChehabChannel, the csrows cross both DIMMs. 382fd77f6baSMauro Carvalho Chehab 383fd77f6baSMauro Carvalho ChehabMemory DIMMs come single or dual "ranked". A rank is a populated csrow. 384778f3a96SRobert RichterIn the example above 2 dual ranked DIMMs are similarly placed. Thus, 385778f3a96SRobert Richterboth csrow0 and csrow1 are populated. On the other hand, when 2 single 386778f3a96SRobert Richterranked DIMMs are placed in slots DIMM_A0 and DIMM_B0, then they will 387778f3a96SRobert Richterhave just one csrow (csrow0) and csrow1 will be empty. The pattern 388778f3a96SRobert Richterrepeats itself for csrow2 and csrow3. Also note that some memory 389778f3a96SRobert Richtercontrollers don't have any logic to identify the memory module, see 390778f3a96SRobert Richter``rankX`` directories below. 391fd77f6baSMauro Carvalho Chehab 392fd77f6baSMauro Carvalho ChehabThe representation of the above is reflected in the directory 393fd77f6baSMauro Carvalho Chehabtree in EDAC's sysfs interface. Starting in directory 394fd77f6baSMauro Carvalho Chehab``/sys/devices/system/edac/mc``, each memory controller will be 395fd77f6baSMauro Carvalho Chehabrepresented by its own ``mcX`` directory, where ``X`` is the 396fd77f6baSMauro Carvalho Chehabindex of the MC:: 397fd77f6baSMauro Carvalho Chehab 398fd77f6baSMauro Carvalho Chehab ..../edac/mc/ 399fd77f6baSMauro Carvalho Chehab | 400fd77f6baSMauro Carvalho Chehab |->mc0 401fd77f6baSMauro Carvalho Chehab |->mc1 402fd77f6baSMauro Carvalho Chehab |->mc2 403fd77f6baSMauro Carvalho Chehab .... 404fd77f6baSMauro Carvalho Chehab 405fd77f6baSMauro Carvalho ChehabUnder each ``mcX`` directory each ``csrowX`` is again represented by a 406fd77f6baSMauro Carvalho Chehab``csrowX``, where ``X`` is the csrow index:: 407fd77f6baSMauro Carvalho Chehab 408fd77f6baSMauro Carvalho Chehab .../mc/mc0/ 409fd77f6baSMauro Carvalho Chehab | 410fd77f6baSMauro Carvalho Chehab |->csrow0 411fd77f6baSMauro Carvalho Chehab |->csrow2 412fd77f6baSMauro Carvalho Chehab |->csrow3 413fd77f6baSMauro Carvalho Chehab .... 414fd77f6baSMauro Carvalho Chehab 415fd77f6baSMauro Carvalho ChehabNotice that there is no csrow1, which indicates that csrow0 is composed 416fd77f6baSMauro Carvalho Chehabof a single ranked DIMMs. This should also apply in both Channels, in 417fd77f6baSMauro Carvalho Chehaborder to have dual-channel mode be operational. Since both csrow2 and 418fd77f6baSMauro Carvalho Chehabcsrow3 are populated, this indicates a dual ranked set of DIMMs for 419fd77f6baSMauro Carvalho Chehabchannels 0 and 1. 420fd77f6baSMauro Carvalho Chehab 421fd77f6baSMauro Carvalho ChehabWithin each of the ``mcX`` and ``csrowX`` directories are several EDAC 422fd77f6baSMauro Carvalho Chehabcontrol and attribute files. 423fd77f6baSMauro Carvalho Chehab 424fd77f6baSMauro Carvalho Chehab``mcX`` directories 425fd77f6baSMauro Carvalho Chehab------------------- 426fd77f6baSMauro Carvalho Chehab 427fd77f6baSMauro Carvalho ChehabIn ``mcX`` directories are EDAC control and attribute files for 428fd77f6baSMauro Carvalho Chehabthis ``X`` instance of the memory controllers. 429fd77f6baSMauro Carvalho Chehab 430fd77f6baSMauro Carvalho ChehabFor a description of the sysfs API, please see: 431fd77f6baSMauro Carvalho Chehab 432fd77f6baSMauro Carvalho Chehab Documentation/ABI/testing/sysfs-devices-edac 433fd77f6baSMauro Carvalho Chehab 434fd77f6baSMauro Carvalho Chehab 435fd77f6baSMauro Carvalho Chehab``dimmX`` or ``rankX`` directories 436fd77f6baSMauro Carvalho Chehab---------------------------------- 437fd77f6baSMauro Carvalho Chehab 438fd77f6baSMauro Carvalho ChehabThe recommended way to use the EDAC subsystem is to look at the information 439fd77f6baSMauro Carvalho Chehabprovided by the ``dimmX`` or ``rankX`` directories [#f5]_. 440fd77f6baSMauro Carvalho Chehab 441fd77f6baSMauro Carvalho ChehabA typical EDAC system has the following structure under 442fd77f6baSMauro Carvalho Chehab``/sys/devices/system/edac/``\ [#f6]_:: 443fd77f6baSMauro Carvalho Chehab 444fd77f6baSMauro Carvalho Chehab /sys/devices/system/edac/ 445fd77f6baSMauro Carvalho Chehab ├── mc 446fd77f6baSMauro Carvalho Chehab │ ├── mc0 447fd77f6baSMauro Carvalho Chehab │ │ ├── ce_count 448fd77f6baSMauro Carvalho Chehab │ │ ├── ce_noinfo_count 449fd77f6baSMauro Carvalho Chehab │ │ ├── dimm0 4504fb6fde7SAaron Miller │ │ │ ├── dimm_ce_count 451fd77f6baSMauro Carvalho Chehab │ │ │ ├── dimm_dev_type 452fd77f6baSMauro Carvalho Chehab │ │ │ ├── dimm_edac_mode 453fd77f6baSMauro Carvalho Chehab │ │ │ ├── dimm_label 454fd77f6baSMauro Carvalho Chehab │ │ │ ├── dimm_location 455fd77f6baSMauro Carvalho Chehab │ │ │ ├── dimm_mem_type 4564fb6fde7SAaron Miller │ │ │ ├── dimm_ue_count 457fd77f6baSMauro Carvalho Chehab │ │ │ ├── size 458fd77f6baSMauro Carvalho Chehab │ │ │ └── uevent 459fd77f6baSMauro Carvalho Chehab │ │ ├── max_location 460fd77f6baSMauro Carvalho Chehab │ │ ├── mc_name 461fd77f6baSMauro Carvalho Chehab │ │ ├── reset_counters 462fd77f6baSMauro Carvalho Chehab │ │ ├── seconds_since_reset 463fd77f6baSMauro Carvalho Chehab │ │ ├── size_mb 464fd77f6baSMauro Carvalho Chehab │ │ ├── ue_count 465fd77f6baSMauro Carvalho Chehab │ │ ├── ue_noinfo_count 466fd77f6baSMauro Carvalho Chehab │ │ └── uevent 467fd77f6baSMauro Carvalho Chehab │ ├── mc1 468fd77f6baSMauro Carvalho Chehab │ │ ├── ce_count 469fd77f6baSMauro Carvalho Chehab │ │ ├── ce_noinfo_count 470fd77f6baSMauro Carvalho Chehab │ │ ├── dimm0 4714fb6fde7SAaron Miller │ │ │ ├── dimm_ce_count 472fd77f6baSMauro Carvalho Chehab │ │ │ ├── dimm_dev_type 473fd77f6baSMauro Carvalho Chehab │ │ │ ├── dimm_edac_mode 474fd77f6baSMauro Carvalho Chehab │ │ │ ├── dimm_label 475fd77f6baSMauro Carvalho Chehab │ │ │ ├── dimm_location 476fd77f6baSMauro Carvalho Chehab │ │ │ ├── dimm_mem_type 4774fb6fde7SAaron Miller │ │ │ ├── dimm_ue_count 478fd77f6baSMauro Carvalho Chehab │ │ │ ├── size 479fd77f6baSMauro Carvalho Chehab │ │ │ └── uevent 480fd77f6baSMauro Carvalho Chehab │ │ ├── max_location 481fd77f6baSMauro Carvalho Chehab │ │ ├── mc_name 482fd77f6baSMauro Carvalho Chehab │ │ ├── reset_counters 483fd77f6baSMauro Carvalho Chehab │ │ ├── seconds_since_reset 484fd77f6baSMauro Carvalho Chehab │ │ ├── size_mb 485fd77f6baSMauro Carvalho Chehab │ │ ├── ue_count 486fd77f6baSMauro Carvalho Chehab │ │ ├── ue_noinfo_count 487fd77f6baSMauro Carvalho Chehab │ │ └── uevent 488fd77f6baSMauro Carvalho Chehab │ └── uevent 489fd77f6baSMauro Carvalho Chehab └── uevent 490fd77f6baSMauro Carvalho Chehab 491fd77f6baSMauro Carvalho ChehabIn the ``dimmX`` directories are EDAC control and attribute files for 492fd77f6baSMauro Carvalho Chehabthis ``X`` memory module: 493fd77f6baSMauro Carvalho Chehab 494fd77f6baSMauro Carvalho Chehab- ``size`` - Total memory managed by this csrow attribute file 495fd77f6baSMauro Carvalho Chehab 496fd77f6baSMauro Carvalho Chehab This attribute file displays, in count of megabytes, the memory 497fd77f6baSMauro Carvalho Chehab that this csrow contains. 498fd77f6baSMauro Carvalho Chehab 4994fb6fde7SAaron Miller- ``dimm_ue_count`` - Uncorrectable Errors count attribute file 5004fb6fde7SAaron Miller 5014fb6fde7SAaron Miller This attribute file displays the total count of uncorrectable 5024fb6fde7SAaron Miller errors that have occurred on this DIMM. If panic_on_ue is set 5034fb6fde7SAaron Miller this counter will not have a chance to increment, since EDAC 5044fb6fde7SAaron Miller will panic the system. 5054fb6fde7SAaron Miller 5064fb6fde7SAaron Miller- ``dimm_ce_count`` - Correctable Errors count attribute file 5074fb6fde7SAaron Miller 5084fb6fde7SAaron Miller This attribute file displays the total count of correctable 5094fb6fde7SAaron Miller errors that have occurred on this DIMM. This count is very 5104fb6fde7SAaron Miller important to examine. CEs provide early indications that a 5114fb6fde7SAaron Miller DIMM is beginning to fail. This count field should be 5124fb6fde7SAaron Miller monitored for non-zero values and report such information 5134fb6fde7SAaron Miller to the system administrator. 5144fb6fde7SAaron Miller 515fd77f6baSMauro Carvalho Chehab- ``dimm_dev_type`` - Device type attribute file 516fd77f6baSMauro Carvalho Chehab 517fd77f6baSMauro Carvalho Chehab This attribute file will display what type of DRAM device is 518fd77f6baSMauro Carvalho Chehab being utilized on this DIMM. 519fd77f6baSMauro Carvalho Chehab Examples: 520fd77f6baSMauro Carvalho Chehab 521fd77f6baSMauro Carvalho Chehab - x1 522fd77f6baSMauro Carvalho Chehab - x2 523fd77f6baSMauro Carvalho Chehab - x4 524fd77f6baSMauro Carvalho Chehab - x8 525fd77f6baSMauro Carvalho Chehab 526fd77f6baSMauro Carvalho Chehab- ``dimm_edac_mode`` - EDAC Mode of operation attribute file 527fd77f6baSMauro Carvalho Chehab 528fd77f6baSMauro Carvalho Chehab This attribute file will display what type of Error detection 529fd77f6baSMauro Carvalho Chehab and correction is being utilized. 530fd77f6baSMauro Carvalho Chehab 531fd77f6baSMauro Carvalho Chehab- ``dimm_label`` - memory module label control file 532fd77f6baSMauro Carvalho Chehab 533fd77f6baSMauro Carvalho Chehab This control file allows this DIMM to have a label assigned 534fd77f6baSMauro Carvalho Chehab to it. With this label in the module, when errors occur 535fd77f6baSMauro Carvalho Chehab the output can provide the DIMM label in the system log. 536fd77f6baSMauro Carvalho Chehab This becomes vital for panic events to isolate the 537fd77f6baSMauro Carvalho Chehab cause of the UE event. 538fd77f6baSMauro Carvalho Chehab 539fd77f6baSMauro Carvalho Chehab DIMM Labels must be assigned after booting, with information 540fd77f6baSMauro Carvalho Chehab that correctly identifies the physical slot with its 541fd77f6baSMauro Carvalho Chehab silk screen label. This information is currently very 542fd77f6baSMauro Carvalho Chehab motherboard specific and determination of this information 543fd77f6baSMauro Carvalho Chehab must occur in userland at this time. 544fd77f6baSMauro Carvalho Chehab 545fd77f6baSMauro Carvalho Chehab- ``dimm_location`` - location of the memory module 546fd77f6baSMauro Carvalho Chehab 547fd77f6baSMauro Carvalho Chehab The location can have up to 3 levels, and describe how the 548fd77f6baSMauro Carvalho Chehab memory controller identifies the location of a memory module. 549fd77f6baSMauro Carvalho Chehab Depending on the type of memory and memory controller, it 550fd77f6baSMauro Carvalho Chehab can be: 551fd77f6baSMauro Carvalho Chehab 552fd77f6baSMauro Carvalho Chehab - *csrow* and *channel* - used when the memory controller 553fd77f6baSMauro Carvalho Chehab doesn't identify a single DIMM - e. g. in ``rankX`` dir; 554fd77f6baSMauro Carvalho Chehab - *branch*, *channel*, *slot* - typically used on FB-DIMM memory 555fd77f6baSMauro Carvalho Chehab controllers; 556fd77f6baSMauro Carvalho Chehab - *channel*, *slot* - used on Nehalem and newer Intel drivers. 557fd77f6baSMauro Carvalho Chehab 558fd77f6baSMauro Carvalho Chehab- ``dimm_mem_type`` - Memory Type attribute file 559fd77f6baSMauro Carvalho Chehab 560fd77f6baSMauro Carvalho Chehab This attribute file will display what type of memory is currently 561fd77f6baSMauro Carvalho Chehab on this csrow. Normally, either buffered or unbuffered memory. 562fd77f6baSMauro Carvalho Chehab Examples: 563fd77f6baSMauro Carvalho Chehab 564fd77f6baSMauro Carvalho Chehab - Registered-DDR 565fd77f6baSMauro Carvalho Chehab - Unbuffered-DDR 566fd77f6baSMauro Carvalho Chehab 567fd77f6baSMauro Carvalho Chehab.. [#f5] On some systems, the memory controller doesn't have any logic 568fd77f6baSMauro Carvalho Chehab to identify the memory module. On such systems, the directory is called ``rankX`` and works on a similar way as the ``csrowX`` directories. 569fd77f6baSMauro Carvalho Chehab On modern Intel memory controllers, the memory controller identifies the 570fd77f6baSMauro Carvalho Chehab memory modules directly. On such systems, the directory is called ``dimmX``. 571fd77f6baSMauro Carvalho Chehab 572fd77f6baSMauro Carvalho Chehab.. [#f6] There are also some ``power`` directories and ``subsystem`` 573fd77f6baSMauro Carvalho Chehab symlinks inside the sysfs mapping that are automatically created by 574fd77f6baSMauro Carvalho Chehab the sysfs subsystem. Currently, they serve no purpose. 575fd77f6baSMauro Carvalho Chehab 576fd77f6baSMauro Carvalho Chehab``csrowX`` directories 577fd77f6baSMauro Carvalho Chehab---------------------- 578fd77f6baSMauro Carvalho Chehab 579fd77f6baSMauro Carvalho ChehabWhen CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the ``csrowX`` 580fd77f6baSMauro Carvalho Chehabdirectories. As this API doesn't work properly for Rambus, FB-DIMMs and 581fd77f6baSMauro Carvalho Chehabmodern Intel Memory Controllers, this is being deprecated in favor of 582fd77f6baSMauro Carvalho Chehab``dimmX`` directories. 583fd77f6baSMauro Carvalho Chehab 584fd77f6baSMauro Carvalho ChehabIn the ``csrowX`` directories are EDAC control and attribute files for 585fd77f6baSMauro Carvalho Chehabthis ``X`` instance of csrow: 586fd77f6baSMauro Carvalho Chehab 587fd77f6baSMauro Carvalho Chehab 588fd77f6baSMauro Carvalho Chehab- ``ue_count`` - Total Uncorrectable Errors count attribute file 589fd77f6baSMauro Carvalho Chehab 590fd77f6baSMauro Carvalho Chehab This attribute file displays the total count of uncorrectable 591fd77f6baSMauro Carvalho Chehab errors that have occurred on this csrow. If panic_on_ue is set 592fd77f6baSMauro Carvalho Chehab this counter will not have a chance to increment, since EDAC 593fd77f6baSMauro Carvalho Chehab will panic the system. 594fd77f6baSMauro Carvalho Chehab 595fd77f6baSMauro Carvalho Chehab 596fd77f6baSMauro Carvalho Chehab- ``ce_count`` - Total Correctable Errors count attribute file 597fd77f6baSMauro Carvalho Chehab 598fd77f6baSMauro Carvalho Chehab This attribute file displays the total count of correctable 599fd77f6baSMauro Carvalho Chehab errors that have occurred on this csrow. This count is very 600fd77f6baSMauro Carvalho Chehab important to examine. CEs provide early indications that a 601fd77f6baSMauro Carvalho Chehab DIMM is beginning to fail. This count field should be 602fd77f6baSMauro Carvalho Chehab monitored for non-zero values and report such information 603fd77f6baSMauro Carvalho Chehab to the system administrator. 604fd77f6baSMauro Carvalho Chehab 605fd77f6baSMauro Carvalho Chehab 606fd77f6baSMauro Carvalho Chehab- ``size_mb`` - Total memory managed by this csrow attribute file 607fd77f6baSMauro Carvalho Chehab 608fd77f6baSMauro Carvalho Chehab This attribute file displays, in count of megabytes, the memory 609fd77f6baSMauro Carvalho Chehab that this csrow contains. 610fd77f6baSMauro Carvalho Chehab 611fd77f6baSMauro Carvalho Chehab 612fd77f6baSMauro Carvalho Chehab- ``mem_type`` - Memory Type attribute file 613fd77f6baSMauro Carvalho Chehab 614fd77f6baSMauro Carvalho Chehab This attribute file will display what type of memory is currently 615fd77f6baSMauro Carvalho Chehab on this csrow. Normally, either buffered or unbuffered memory. 616fd77f6baSMauro Carvalho Chehab Examples: 617fd77f6baSMauro Carvalho Chehab 618fd77f6baSMauro Carvalho Chehab - Registered-DDR 619fd77f6baSMauro Carvalho Chehab - Unbuffered-DDR 620fd77f6baSMauro Carvalho Chehab 621fd77f6baSMauro Carvalho Chehab 622fd77f6baSMauro Carvalho Chehab- ``edac_mode`` - EDAC Mode of operation attribute file 623fd77f6baSMauro Carvalho Chehab 624fd77f6baSMauro Carvalho Chehab This attribute file will display what type of Error detection 625fd77f6baSMauro Carvalho Chehab and correction is being utilized. 626fd77f6baSMauro Carvalho Chehab 627fd77f6baSMauro Carvalho Chehab 628fd77f6baSMauro Carvalho Chehab- ``dev_type`` - Device type attribute file 629fd77f6baSMauro Carvalho Chehab 630fd77f6baSMauro Carvalho Chehab This attribute file will display what type of DRAM device is 631fd77f6baSMauro Carvalho Chehab being utilized on this DIMM. 632fd77f6baSMauro Carvalho Chehab Examples: 633fd77f6baSMauro Carvalho Chehab 634fd77f6baSMauro Carvalho Chehab - x1 635fd77f6baSMauro Carvalho Chehab - x2 636fd77f6baSMauro Carvalho Chehab - x4 637fd77f6baSMauro Carvalho Chehab - x8 638fd77f6baSMauro Carvalho Chehab 639fd77f6baSMauro Carvalho Chehab 640fd77f6baSMauro Carvalho Chehab- ``ch0_ce_count`` - Channel 0 CE Count attribute file 641fd77f6baSMauro Carvalho Chehab 642fd77f6baSMauro Carvalho Chehab This attribute file will display the count of CEs on this 643fd77f6baSMauro Carvalho Chehab DIMM located in channel 0. 644fd77f6baSMauro Carvalho Chehab 645fd77f6baSMauro Carvalho Chehab 646fd77f6baSMauro Carvalho Chehab- ``ch0_ue_count`` - Channel 0 UE Count attribute file 647fd77f6baSMauro Carvalho Chehab 648fd77f6baSMauro Carvalho Chehab This attribute file will display the count of UEs on this 649fd77f6baSMauro Carvalho Chehab DIMM located in channel 0. 650fd77f6baSMauro Carvalho Chehab 651fd77f6baSMauro Carvalho Chehab 652fd77f6baSMauro Carvalho Chehab- ``ch0_dimm_label`` - Channel 0 DIMM Label control file 653fd77f6baSMauro Carvalho Chehab 654fd77f6baSMauro Carvalho Chehab 655fd77f6baSMauro Carvalho Chehab This control file allows this DIMM to have a label assigned 656fd77f6baSMauro Carvalho Chehab to it. With this label in the module, when errors occur 657fd77f6baSMauro Carvalho Chehab the output can provide the DIMM label in the system log. 658fd77f6baSMauro Carvalho Chehab This becomes vital for panic events to isolate the 659fd77f6baSMauro Carvalho Chehab cause of the UE event. 660fd77f6baSMauro Carvalho Chehab 661fd77f6baSMauro Carvalho Chehab DIMM Labels must be assigned after booting, with information 662fd77f6baSMauro Carvalho Chehab that correctly identifies the physical slot with its 663fd77f6baSMauro Carvalho Chehab silk screen label. This information is currently very 664fd77f6baSMauro Carvalho Chehab motherboard specific and determination of this information 665fd77f6baSMauro Carvalho Chehab must occur in userland at this time. 666fd77f6baSMauro Carvalho Chehab 667fd77f6baSMauro Carvalho Chehab 668fd77f6baSMauro Carvalho Chehab- ``ch1_ce_count`` - Channel 1 CE Count attribute file 669fd77f6baSMauro Carvalho Chehab 670fd77f6baSMauro Carvalho Chehab 671fd77f6baSMauro Carvalho Chehab This attribute file will display the count of CEs on this 672fd77f6baSMauro Carvalho Chehab DIMM located in channel 1. 673fd77f6baSMauro Carvalho Chehab 674fd77f6baSMauro Carvalho Chehab 675fd77f6baSMauro Carvalho Chehab- ``ch1_ue_count`` - Channel 1 UE Count attribute file 676fd77f6baSMauro Carvalho Chehab 677fd77f6baSMauro Carvalho Chehab 678fd77f6baSMauro Carvalho Chehab This attribute file will display the count of UEs on this 679fd77f6baSMauro Carvalho Chehab DIMM located in channel 0. 680fd77f6baSMauro Carvalho Chehab 681fd77f6baSMauro Carvalho Chehab 682fd77f6baSMauro Carvalho Chehab- ``ch1_dimm_label`` - Channel 1 DIMM Label control file 683fd77f6baSMauro Carvalho Chehab 684fd77f6baSMauro Carvalho Chehab This control file allows this DIMM to have a label assigned 685fd77f6baSMauro Carvalho Chehab to it. With this label in the module, when errors occur 686fd77f6baSMauro Carvalho Chehab the output can provide the DIMM label in the system log. 687fd77f6baSMauro Carvalho Chehab This becomes vital for panic events to isolate the 688fd77f6baSMauro Carvalho Chehab cause of the UE event. 689fd77f6baSMauro Carvalho Chehab 690fd77f6baSMauro Carvalho Chehab DIMM Labels must be assigned after booting, with information 691fd77f6baSMauro Carvalho Chehab that correctly identifies the physical slot with its 692fd77f6baSMauro Carvalho Chehab silk screen label. This information is currently very 693fd77f6baSMauro Carvalho Chehab motherboard specific and determination of this information 694fd77f6baSMauro Carvalho Chehab must occur in userland at this time. 695fd77f6baSMauro Carvalho Chehab 696fd77f6baSMauro Carvalho Chehab 697fd77f6baSMauro Carvalho ChehabSystem Logging 698fd77f6baSMauro Carvalho Chehab-------------- 699fd77f6baSMauro Carvalho Chehab 700fd77f6baSMauro Carvalho ChehabIf logging for UEs and CEs is enabled, then system logs will contain 701fd77f6baSMauro Carvalho Chehabinformation indicating that errors have been detected:: 702fd77f6baSMauro Carvalho Chehab 703fd77f6baSMauro Carvalho Chehab EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, channel 1 "DIMM_B1": amd76x_edac 704fd77f6baSMauro Carvalho Chehab EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0, channel 1 "DIMM_B1": amd76x_edac 705fd77f6baSMauro Carvalho Chehab 706fd77f6baSMauro Carvalho Chehab 707fd77f6baSMauro Carvalho ChehabThe structure of the message is: 708fd77f6baSMauro Carvalho Chehab 709fd77f6baSMauro Carvalho Chehab +---------------------------------------+-------------+ 71082a19551SJonathan Corbet | Content | Example | 711fd77f6baSMauro Carvalho Chehab +=======================================+=============+ 712fd77f6baSMauro Carvalho Chehab | The memory controller | MC0 | 713fd77f6baSMauro Carvalho Chehab +---------------------------------------+-------------+ 714fd77f6baSMauro Carvalho Chehab | Error type | CE | 715fd77f6baSMauro Carvalho Chehab +---------------------------------------+-------------+ 716fd77f6baSMauro Carvalho Chehab | Memory page | 0x283 | 717fd77f6baSMauro Carvalho Chehab +---------------------------------------+-------------+ 718fd77f6baSMauro Carvalho Chehab | Offset in the page | 0xce0 | 719fd77f6baSMauro Carvalho Chehab +---------------------------------------+-------------+ 720fd77f6baSMauro Carvalho Chehab | The byte granularity | grain 8 | 721fd77f6baSMauro Carvalho Chehab | or resolution of the error | | 722fd77f6baSMauro Carvalho Chehab +---------------------------------------+-------------+ 723fd77f6baSMauro Carvalho Chehab | The error syndrome | 0xb741 | 724fd77f6baSMauro Carvalho Chehab +---------------------------------------+-------------+ 72582a19551SJonathan Corbet | Memory row | row 0 | 726fd77f6baSMauro Carvalho Chehab +---------------------------------------+-------------+ 727fd77f6baSMauro Carvalho Chehab | Memory channel | channel 1 | 728fd77f6baSMauro Carvalho Chehab +---------------------------------------+-------------+ 729fd77f6baSMauro Carvalho Chehab | DIMM label, if set prior | DIMM B1 | 730fd77f6baSMauro Carvalho Chehab +---------------------------------------+-------------+ 731fd77f6baSMauro Carvalho Chehab | And then an optional, driver-specific | | 732fd77f6baSMauro Carvalho Chehab | message that may have additional | | 733fd77f6baSMauro Carvalho Chehab | information. | | 734fd77f6baSMauro Carvalho Chehab +---------------------------------------+-------------+ 735fd77f6baSMauro Carvalho Chehab 736fd77f6baSMauro Carvalho ChehabBoth UEs and CEs with no info will lack all but memory controller, error 737fd77f6baSMauro Carvalho Chehabtype, a notice of "no info" and then an optional, driver-specific error 738fd77f6baSMauro Carvalho Chehabmessage. 739fd77f6baSMauro Carvalho Chehab 740fd77f6baSMauro Carvalho Chehab 741fd77f6baSMauro Carvalho ChehabPCI Bus Parity Detection 742fd77f6baSMauro Carvalho Chehab------------------------ 743fd77f6baSMauro Carvalho Chehab 744fd77f6baSMauro Carvalho ChehabOn Header Type 00 devices, the primary status is looked at for any 745fd77f6baSMauro Carvalho Chehabparity error regardless of whether parity is enabled on the device or 746fd77f6baSMauro Carvalho Chehabnot. (The spec indicates parity is generated in some cases). On Header 747fd77f6baSMauro Carvalho ChehabType 01 bridges, the secondary status register is also looked at to see 748fd77f6baSMauro Carvalho Chehabif parity occurred on the bus on the other side of the bridge. 749fd77f6baSMauro Carvalho Chehab 750fd77f6baSMauro Carvalho Chehab 751fd77f6baSMauro Carvalho ChehabSysfs configuration 752fd77f6baSMauro Carvalho Chehab------------------- 753fd77f6baSMauro Carvalho Chehab 754fd77f6baSMauro Carvalho ChehabUnder ``/sys/devices/system/edac/pci`` are control and attribute files as 755fd77f6baSMauro Carvalho Chehabfollows: 756fd77f6baSMauro Carvalho Chehab 757fd77f6baSMauro Carvalho Chehab 758fd77f6baSMauro Carvalho Chehab- ``check_pci_parity`` - Enable/Disable PCI Parity checking control file 759fd77f6baSMauro Carvalho Chehab 760fd77f6baSMauro Carvalho Chehab This control file enables or disables the PCI Bus Parity scanning 761fd77f6baSMauro Carvalho Chehab operation. Writing a 1 to this file enables the scanning. Writing 762fd77f6baSMauro Carvalho Chehab a 0 to this file disables the scanning. 763fd77f6baSMauro Carvalho Chehab 764fd77f6baSMauro Carvalho Chehab Enable:: 765fd77f6baSMauro Carvalho Chehab 766fd77f6baSMauro Carvalho Chehab echo "1" >/sys/devices/system/edac/pci/check_pci_parity 767fd77f6baSMauro Carvalho Chehab 768fd77f6baSMauro Carvalho Chehab Disable:: 769fd77f6baSMauro Carvalho Chehab 770fd77f6baSMauro Carvalho Chehab echo "0" >/sys/devices/system/edac/pci/check_pci_parity 771fd77f6baSMauro Carvalho Chehab 772fd77f6baSMauro Carvalho Chehab 773fd77f6baSMauro Carvalho Chehab- ``pci_parity_count`` - Parity Count 774fd77f6baSMauro Carvalho Chehab 775fd77f6baSMauro Carvalho Chehab This attribute file will display the number of parity errors that 776fd77f6baSMauro Carvalho Chehab have been detected. 777fd77f6baSMauro Carvalho Chehab 778fd77f6baSMauro Carvalho Chehab 779fd77f6baSMauro Carvalho ChehabModule parameters 780fd77f6baSMauro Carvalho Chehab----------------- 781fd77f6baSMauro Carvalho Chehab 782fd77f6baSMauro Carvalho Chehab- ``edac_mc_panic_on_ue`` - Panic on UE control file 783fd77f6baSMauro Carvalho Chehab 784fd77f6baSMauro Carvalho Chehab An uncorrectable error will cause a machine panic. This is usually 785fd77f6baSMauro Carvalho Chehab desirable. It is a bad idea to continue when an uncorrectable error 786fd77f6baSMauro Carvalho Chehab occurs - it is indeterminate what was uncorrected and the operating 787fd77f6baSMauro Carvalho Chehab system context might be so mangled that continuing will lead to further 788fd77f6baSMauro Carvalho Chehab corruption. If the kernel has MCE configured, then EDAC will never 789fd77f6baSMauro Carvalho Chehab notice the UE. 790fd77f6baSMauro Carvalho Chehab 791fd77f6baSMauro Carvalho Chehab LOAD TIME:: 792fd77f6baSMauro Carvalho Chehab 793fd77f6baSMauro Carvalho Chehab module/kernel parameter: edac_mc_panic_on_ue=[0|1] 794fd77f6baSMauro Carvalho Chehab 795fd77f6baSMauro Carvalho Chehab RUN TIME:: 796fd77f6baSMauro Carvalho Chehab 797fd77f6baSMauro Carvalho Chehab echo "1" > /sys/module/edac_core/parameters/edac_mc_panic_on_ue 798fd77f6baSMauro Carvalho Chehab 799fd77f6baSMauro Carvalho Chehab 800fd77f6baSMauro Carvalho Chehab- ``edac_mc_log_ue`` - Log UE control file 801fd77f6baSMauro Carvalho Chehab 802fd77f6baSMauro Carvalho Chehab 803fd77f6baSMauro Carvalho Chehab Generate kernel messages describing uncorrectable errors. These errors 804fd77f6baSMauro Carvalho Chehab are reported through the system message log system. UE statistics 805fd77f6baSMauro Carvalho Chehab will be accumulated even when UE logging is disabled. 806fd77f6baSMauro Carvalho Chehab 807fd77f6baSMauro Carvalho Chehab LOAD TIME:: 808fd77f6baSMauro Carvalho Chehab 809fd77f6baSMauro Carvalho Chehab module/kernel parameter: edac_mc_log_ue=[0|1] 810fd77f6baSMauro Carvalho Chehab 811fd77f6baSMauro Carvalho Chehab RUN TIME:: 812fd77f6baSMauro Carvalho Chehab 813fd77f6baSMauro Carvalho Chehab echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ue 814fd77f6baSMauro Carvalho Chehab 815fd77f6baSMauro Carvalho Chehab 816fd77f6baSMauro Carvalho Chehab- ``edac_mc_log_ce`` - Log CE control file 817fd77f6baSMauro Carvalho Chehab 818fd77f6baSMauro Carvalho Chehab 819fd77f6baSMauro Carvalho Chehab Generate kernel messages describing correctable errors. These 820fd77f6baSMauro Carvalho Chehab errors are reported through the system message log system. 821fd77f6baSMauro Carvalho Chehab CE statistics will be accumulated even when CE logging is disabled. 822fd77f6baSMauro Carvalho Chehab 823fd77f6baSMauro Carvalho Chehab LOAD TIME:: 824fd77f6baSMauro Carvalho Chehab 825fd77f6baSMauro Carvalho Chehab module/kernel parameter: edac_mc_log_ce=[0|1] 826fd77f6baSMauro Carvalho Chehab 827fd77f6baSMauro Carvalho Chehab RUN TIME:: 828fd77f6baSMauro Carvalho Chehab 829fd77f6baSMauro Carvalho Chehab echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ce 830fd77f6baSMauro Carvalho Chehab 831fd77f6baSMauro Carvalho Chehab 832fd77f6baSMauro Carvalho Chehab- ``edac_mc_poll_msec`` - Polling period control file 833fd77f6baSMauro Carvalho Chehab 834fd77f6baSMauro Carvalho Chehab 835fd77f6baSMauro Carvalho Chehab The time period, in milliseconds, for polling for error information. 836fd77f6baSMauro Carvalho Chehab Too small a value wastes resources. Too large a value might delay 837fd77f6baSMauro Carvalho Chehab necessary handling of errors and might loose valuable information for 838fd77f6baSMauro Carvalho Chehab locating the error. 1000 milliseconds (once each second) is the current 839fd77f6baSMauro Carvalho Chehab default. Systems which require all the bandwidth they can get, may 840fd77f6baSMauro Carvalho Chehab increase this. 841fd77f6baSMauro Carvalho Chehab 842fd77f6baSMauro Carvalho Chehab LOAD TIME:: 843fd77f6baSMauro Carvalho Chehab 844fd77f6baSMauro Carvalho Chehab module/kernel parameter: edac_mc_poll_msec=[0|1] 845fd77f6baSMauro Carvalho Chehab 846fd77f6baSMauro Carvalho Chehab RUN TIME:: 847fd77f6baSMauro Carvalho Chehab 848fd77f6baSMauro Carvalho Chehab echo "1000" > /sys/module/edac_core/parameters/edac_mc_poll_msec 849fd77f6baSMauro Carvalho Chehab 850fd77f6baSMauro Carvalho Chehab 851fd77f6baSMauro Carvalho Chehab- ``panic_on_pci_parity`` - Panic on PCI PARITY Error 852fd77f6baSMauro Carvalho Chehab 853fd77f6baSMauro Carvalho Chehab 854fd77f6baSMauro Carvalho Chehab This control file enables or disables panicking when a parity 855fd77f6baSMauro Carvalho Chehab error has been detected. 856fd77f6baSMauro Carvalho Chehab 857fd77f6baSMauro Carvalho Chehab 858fd77f6baSMauro Carvalho Chehab module/kernel parameter:: 859fd77f6baSMauro Carvalho Chehab 860fd77f6baSMauro Carvalho Chehab edac_panic_on_pci_pe=[0|1] 861fd77f6baSMauro Carvalho Chehab 862fd77f6baSMauro Carvalho Chehab Enable:: 863fd77f6baSMauro Carvalho Chehab 864fd77f6baSMauro Carvalho Chehab echo "1" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe 865fd77f6baSMauro Carvalho Chehab 866fd77f6baSMauro Carvalho Chehab Disable:: 867fd77f6baSMauro Carvalho Chehab 868fd77f6baSMauro Carvalho Chehab echo "0" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe 869fd77f6baSMauro Carvalho Chehab 870fd77f6baSMauro Carvalho Chehab 871fd77f6baSMauro Carvalho Chehab 872fd77f6baSMauro Carvalho ChehabEDAC device type 873fd77f6baSMauro Carvalho Chehab---------------- 874fd77f6baSMauro Carvalho Chehab 87566c222a0SMauro Carvalho ChehabIn the header file, edac_pci.h, there is a series of edac_device structures 876fd77f6baSMauro Carvalho Chehaband APIs for the EDAC_DEVICE. 877fd77f6baSMauro Carvalho Chehab 878fd77f6baSMauro Carvalho ChehabUser space access to an edac_device is through the sysfs interface. 879fd77f6baSMauro Carvalho Chehab 880fd77f6baSMauro Carvalho ChehabAt the location ``/sys/devices/system/edac`` (sysfs) new edac_device devices 881fd77f6baSMauro Carvalho Chehabwill appear. 882fd77f6baSMauro Carvalho Chehab 883fd77f6baSMauro Carvalho ChehabThere is a three level tree beneath the above ``edac`` directory. For example, 884fd77f6baSMauro Carvalho Chehabthe ``test_device_edac`` device (found at the http://bluesmoke.sourceforget.net 885fd77f6baSMauro Carvalho Chehabwebsite) installs itself as:: 886fd77f6baSMauro Carvalho Chehab 887fd77f6baSMauro Carvalho Chehab /sys/devices/system/edac/test-instance 888fd77f6baSMauro Carvalho Chehab 889fd77f6baSMauro Carvalho Chehabin this directory are various controls, a symlink and one or more ``instance`` 890fd77f6baSMauro Carvalho Chehabdirectories. 891fd77f6baSMauro Carvalho Chehab 892fd77f6baSMauro Carvalho ChehabThe standard default controls are: 893fd77f6baSMauro Carvalho Chehab 894fd77f6baSMauro Carvalho Chehab ============== ======================================================= 895fd77f6baSMauro Carvalho Chehab log_ce boolean to log CE events 896fd77f6baSMauro Carvalho Chehab log_ue boolean to log UE events 897fd77f6baSMauro Carvalho Chehab panic_on_ue boolean to ``panic`` the system if an UE is encountered 898fd77f6baSMauro Carvalho Chehab (default off, can be set true via startup script) 899fd77f6baSMauro Carvalho Chehab poll_msec time period between POLL cycles for events 900fd77f6baSMauro Carvalho Chehab ============== ======================================================= 901fd77f6baSMauro Carvalho Chehab 902fd77f6baSMauro Carvalho ChehabThe test_device_edac device adds at least one of its own custom control: 903fd77f6baSMauro Carvalho Chehab 904fd77f6baSMauro Carvalho Chehab ============== ================================================== 905fd77f6baSMauro Carvalho Chehab test_bits which in the current test driver does nothing but 906fd77f6baSMauro Carvalho Chehab show how it is installed. A ported driver can 907fd77f6baSMauro Carvalho Chehab add one or more such controls and/or attributes 908fd77f6baSMauro Carvalho Chehab for specific uses. 909fd77f6baSMauro Carvalho Chehab One out-of-tree driver uses controls here to allow 910fd77f6baSMauro Carvalho Chehab for ERROR INJECTION operations to hardware 911fd77f6baSMauro Carvalho Chehab injection registers 912fd77f6baSMauro Carvalho Chehab ============== ================================================== 913fd77f6baSMauro Carvalho Chehab 914fd77f6baSMauro Carvalho ChehabThe symlink points to the 'struct dev' that is registered for this edac_device. 915fd77f6baSMauro Carvalho Chehab 916fd77f6baSMauro Carvalho ChehabInstances 917fd77f6baSMauro Carvalho Chehab--------- 918fd77f6baSMauro Carvalho Chehab 919fd77f6baSMauro Carvalho ChehabOne or more instance directories are present. For the ``test_device_edac`` 920fd77f6baSMauro Carvalho Chehabcase: 921fd77f6baSMauro Carvalho Chehab 922fd77f6baSMauro Carvalho Chehab +----------------+ 923fd77f6baSMauro Carvalho Chehab | test-instance0 | 924fd77f6baSMauro Carvalho Chehab +----------------+ 925fd77f6baSMauro Carvalho Chehab 926fd77f6baSMauro Carvalho Chehab 927fd77f6baSMauro Carvalho ChehabIn this directory there are two default counter attributes, which are totals of 928fd77f6baSMauro Carvalho Chehabcounter in deeper subdirectories. 929fd77f6baSMauro Carvalho Chehab 930fd77f6baSMauro Carvalho Chehab ============== ==================================== 931fd77f6baSMauro Carvalho Chehab ce_count total of CE events of subdirectories 932fd77f6baSMauro Carvalho Chehab ue_count total of UE events of subdirectories 933fd77f6baSMauro Carvalho Chehab ============== ==================================== 934fd77f6baSMauro Carvalho Chehab 935fd77f6baSMauro Carvalho ChehabBlocks 936fd77f6baSMauro Carvalho Chehab------ 937fd77f6baSMauro Carvalho Chehab 938fd77f6baSMauro Carvalho ChehabAt the lowest directory level is the ``block`` directory. There can be 0, 1 939fd77f6baSMauro Carvalho Chehabor more blocks specified in each instance: 940fd77f6baSMauro Carvalho Chehab 941fd77f6baSMauro Carvalho Chehab +-------------+ 942fd77f6baSMauro Carvalho Chehab | test-block0 | 943fd77f6baSMauro Carvalho Chehab +-------------+ 944fd77f6baSMauro Carvalho Chehab 945fd77f6baSMauro Carvalho ChehabIn this directory the default attributes are: 946fd77f6baSMauro Carvalho Chehab 947fd77f6baSMauro Carvalho Chehab ============== ================================================ 948fd77f6baSMauro Carvalho Chehab ce_count which is counter of CE events for this ``block`` 949fd77f6baSMauro Carvalho Chehab of hardware being monitored 950fd77f6baSMauro Carvalho Chehab ue_count which is counter of UE events for this ``block`` 951fd77f6baSMauro Carvalho Chehab of hardware being monitored 952fd77f6baSMauro Carvalho Chehab ============== ================================================ 953fd77f6baSMauro Carvalho Chehab 954fd77f6baSMauro Carvalho Chehab 955fd77f6baSMauro Carvalho ChehabThe ``test_device_edac`` device adds 4 attributes and 1 control: 956fd77f6baSMauro Carvalho Chehab 957fd77f6baSMauro Carvalho Chehab ================== ==================================================== 958fd77f6baSMauro Carvalho Chehab test-block-bits-0 for every POLL cycle this counter 959fd77f6baSMauro Carvalho Chehab is incremented 960fd77f6baSMauro Carvalho Chehab test-block-bits-1 every 10 cycles, this counter is bumped once, 961fd77f6baSMauro Carvalho Chehab and test-block-bits-0 is set to 0 962fd77f6baSMauro Carvalho Chehab test-block-bits-2 every 100 cycles, this counter is bumped once, 963fd77f6baSMauro Carvalho Chehab and test-block-bits-1 is set to 0 964fd77f6baSMauro Carvalho Chehab test-block-bits-3 every 1000 cycles, this counter is bumped once, 965fd77f6baSMauro Carvalho Chehab and test-block-bits-2 is set to 0 966fd77f6baSMauro Carvalho Chehab ================== ==================================================== 967fd77f6baSMauro Carvalho Chehab 968fd77f6baSMauro Carvalho Chehab 969fd77f6baSMauro Carvalho Chehab ================== ==================================================== 970fd77f6baSMauro Carvalho Chehab reset-counters writing ANY thing to this control will 971fd77f6baSMauro Carvalho Chehab reset all the above counters. 972fd77f6baSMauro Carvalho Chehab ================== ==================================================== 973fd77f6baSMauro Carvalho Chehab 974fd77f6baSMauro Carvalho Chehab 975fd77f6baSMauro Carvalho ChehabUse of the ``test_device_edac`` driver should enable any others to create their own 976fd77f6baSMauro Carvalho Chehabunique drivers for their hardware systems. 977fd77f6baSMauro Carvalho Chehab 978fd77f6baSMauro Carvalho ChehabThe ``test_device_edac`` sample driver is located at the 979fd77f6baSMauro Carvalho Chehabhttp://bluesmoke.sourceforge.net project site for EDAC. 980fd77f6baSMauro Carvalho Chehab 981fd77f6baSMauro Carvalho Chehab 982fd77f6baSMauro Carvalho ChehabUsage of EDAC APIs on Nehalem and newer Intel CPUs 983fd77f6baSMauro Carvalho Chehab-------------------------------------------------- 984fd77f6baSMauro Carvalho Chehab 985fd77f6baSMauro Carvalho ChehabOn older Intel architectures, the memory controller was part of the North 986fd77f6baSMauro Carvalho ChehabBridge chipset. Nehalem, Sandy Bridge, Ivy Bridge, Haswell, Sky Lake and 987fd77f6baSMauro Carvalho Chehabnewer Intel architectures integrated an enhanced version of the memory 988fd77f6baSMauro Carvalho Chehabcontroller (MC) inside the CPUs. 989fd77f6baSMauro Carvalho Chehab 990fd77f6baSMauro Carvalho ChehabThis chapter will cover the differences of the enhanced memory controllers 991fd77f6baSMauro Carvalho Chehabfound on newer Intel CPUs, such as ``i7core_edac``, ``sb_edac`` and 992fd77f6baSMauro Carvalho Chehab``sbx_edac`` drivers. 993fd77f6baSMauro Carvalho Chehab 994fd77f6baSMauro Carvalho Chehab.. note:: 995fd77f6baSMauro Carvalho Chehab 996fd77f6baSMauro Carvalho Chehab The Xeon E7 processor families use a separate chip for the memory 997fd77f6baSMauro Carvalho Chehab controller, called Intel Scalable Memory Buffer. This section doesn't 998fd77f6baSMauro Carvalho Chehab apply for such families. 999fd77f6baSMauro Carvalho Chehab 1000fd77f6baSMauro Carvalho Chehab1) There is one Memory Controller per Quick Patch Interconnect 1001fd77f6baSMauro Carvalho Chehab (QPI). At the driver, the term "socket" means one QPI. This is 1002fd77f6baSMauro Carvalho Chehab associated with a physical CPU socket. 1003fd77f6baSMauro Carvalho Chehab 1004fd77f6baSMauro Carvalho Chehab Each MC have 3 physical read channels, 3 physical write channels and 1005fd77f6baSMauro Carvalho Chehab 3 logic channels. The driver currently sees it as just 3 channels. 1006fd77f6baSMauro Carvalho Chehab Each channel can have up to 3 DIMMs. 1007fd77f6baSMauro Carvalho Chehab 1008fd77f6baSMauro Carvalho Chehab The minimum known unity is DIMMs. There are no information about csrows. 1009fd77f6baSMauro Carvalho Chehab As EDAC API maps the minimum unity is csrows, the driver sequentially 1010fd77f6baSMauro Carvalho Chehab maps channel/DIMM into different csrows. 1011fd77f6baSMauro Carvalho Chehab 1012fd77f6baSMauro Carvalho Chehab For example, supposing the following layout:: 1013fd77f6baSMauro Carvalho Chehab 1014fd77f6baSMauro Carvalho Chehab Ch0 phy rd0, wr0 (0x063f4031): 2 ranks, UDIMMs 1015fd77f6baSMauro Carvalho Chehab dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 1016fd77f6baSMauro Carvalho Chehab dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400 1017fd77f6baSMauro Carvalho Chehab Ch1 phy rd1, wr1 (0x063f4031): 2 ranks, UDIMMs 1018fd77f6baSMauro Carvalho Chehab dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 1019fd77f6baSMauro Carvalho Chehab Ch2 phy rd3, wr3 (0x063f4031): 2 ranks, UDIMMs 1020fd77f6baSMauro Carvalho Chehab dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 1021fd77f6baSMauro Carvalho Chehab 1022fd77f6baSMauro Carvalho Chehab The driver will map it as:: 1023fd77f6baSMauro Carvalho Chehab 1024fd77f6baSMauro Carvalho Chehab csrow0: channel 0, dimm0 1025fd77f6baSMauro Carvalho Chehab csrow1: channel 0, dimm1 1026fd77f6baSMauro Carvalho Chehab csrow2: channel 1, dimm0 1027fd77f6baSMauro Carvalho Chehab csrow3: channel 2, dimm0 1028fd77f6baSMauro Carvalho Chehab 1029fd77f6baSMauro Carvalho Chehab exports one DIMM per csrow. 1030fd77f6baSMauro Carvalho Chehab 1031fd77f6baSMauro Carvalho Chehab Each QPI is exported as a different memory controller. 1032fd77f6baSMauro Carvalho Chehab 1033fd77f6baSMauro Carvalho Chehab2) The MC has the ability to inject errors to test drivers. The drivers 1034fd77f6baSMauro Carvalho Chehab implement this functionality via some error injection nodes: 1035fd77f6baSMauro Carvalho Chehab 1036fd77f6baSMauro Carvalho Chehab For injecting a memory error, there are some sysfs nodes, under 1037fd77f6baSMauro Carvalho Chehab ``/sys/devices/system/edac/mc/mc?/``: 1038fd77f6baSMauro Carvalho Chehab 1039fd77f6baSMauro Carvalho Chehab - ``inject_addrmatch/*``: 1040fd77f6baSMauro Carvalho Chehab Controls the error injection mask register. It is possible to specify 1041fd77f6baSMauro Carvalho Chehab several characteristics of the address to match an error code:: 1042fd77f6baSMauro Carvalho Chehab 1043fd77f6baSMauro Carvalho Chehab dimm = the affected dimm. Numbers are relative to a channel; 1044fd77f6baSMauro Carvalho Chehab rank = the memory rank; 1045fd77f6baSMauro Carvalho Chehab channel = the channel that will generate an error; 1046fd77f6baSMauro Carvalho Chehab bank = the affected bank; 1047fd77f6baSMauro Carvalho Chehab page = the page address; 1048fd77f6baSMauro Carvalho Chehab column (or col) = the address column. 1049fd77f6baSMauro Carvalho Chehab 1050fd77f6baSMauro Carvalho Chehab each of the above values can be set to "any" to match any valid value. 1051fd77f6baSMauro Carvalho Chehab 1052fd77f6baSMauro Carvalho Chehab At driver init, all values are set to any. 1053fd77f6baSMauro Carvalho Chehab 1054fd77f6baSMauro Carvalho Chehab For example, to generate an error at rank 1 of dimm 2, for any channel, 1055fd77f6baSMauro Carvalho Chehab any bank, any page, any column:: 1056fd77f6baSMauro Carvalho Chehab 1057fd77f6baSMauro Carvalho Chehab echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm 1058fd77f6baSMauro Carvalho Chehab echo 1 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank 1059fd77f6baSMauro Carvalho Chehab 1060fd77f6baSMauro Carvalho Chehab To return to the default behaviour of matching any, you can do:: 1061fd77f6baSMauro Carvalho Chehab 1062fd77f6baSMauro Carvalho Chehab echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm 1063fd77f6baSMauro Carvalho Chehab echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank 1064fd77f6baSMauro Carvalho Chehab 1065fd77f6baSMauro Carvalho Chehab - ``inject_eccmask``: 1066fd77f6baSMauro Carvalho Chehab specifies what bits will have troubles, 1067fd77f6baSMauro Carvalho Chehab 1068fd77f6baSMauro Carvalho Chehab - ``inject_section``: 1069fd77f6baSMauro Carvalho Chehab specifies what ECC cache section will get the error:: 1070fd77f6baSMauro Carvalho Chehab 1071fd77f6baSMauro Carvalho Chehab 3 for both 1072fd77f6baSMauro Carvalho Chehab 2 for the highest 1073fd77f6baSMauro Carvalho Chehab 1 for the lowest 1074fd77f6baSMauro Carvalho Chehab 1075fd77f6baSMauro Carvalho Chehab - ``inject_type``: 1076fd77f6baSMauro Carvalho Chehab specifies the type of error, being a combination of the following bits:: 1077fd77f6baSMauro Carvalho Chehab 1078fd77f6baSMauro Carvalho Chehab bit 0 - repeat 1079fd77f6baSMauro Carvalho Chehab bit 1 - ecc 1080fd77f6baSMauro Carvalho Chehab bit 2 - parity 1081fd77f6baSMauro Carvalho Chehab 1082fd77f6baSMauro Carvalho Chehab - ``inject_enable``: 1083fd77f6baSMauro Carvalho Chehab starts the error generation when something different than 0 is written. 1084fd77f6baSMauro Carvalho Chehab 1085fd77f6baSMauro Carvalho Chehab All inject vars can be read. root permission is needed for write. 1086fd77f6baSMauro Carvalho Chehab 1087fd77f6baSMauro Carvalho Chehab Datasheet states that the error will only be generated after a write on an 1088fd77f6baSMauro Carvalho Chehab address that matches inject_addrmatch. It seems, however, that reading will 1089fd77f6baSMauro Carvalho Chehab also produce an error. 1090fd77f6baSMauro Carvalho Chehab 1091fd77f6baSMauro Carvalho Chehab For example, the following code will generate an error for any write access 1092fd77f6baSMauro Carvalho Chehab at socket 0, on any DIMM/address on channel 2:: 1093fd77f6baSMauro Carvalho Chehab 1094fd77f6baSMauro Carvalho Chehab echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/channel 1095fd77f6baSMauro Carvalho Chehab echo 2 >/sys/devices/system/edac/mc/mc0/inject_type 1096fd77f6baSMauro Carvalho Chehab echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask 1097fd77f6baSMauro Carvalho Chehab echo 3 >/sys/devices/system/edac/mc/mc0/inject_section 1098fd77f6baSMauro Carvalho Chehab echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable 1099fd77f6baSMauro Carvalho Chehab dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null 1100fd77f6baSMauro Carvalho Chehab 1101fd77f6baSMauro Carvalho Chehab For socket 1, it is needed to replace "mc0" by "mc1" at the above 1102fd77f6baSMauro Carvalho Chehab commands. 1103fd77f6baSMauro Carvalho Chehab 1104fd77f6baSMauro Carvalho Chehab The generated error message will look like:: 1105fd77f6baSMauro Carvalho Chehab 1106fd77f6baSMauro Carvalho Chehab EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error)) 1107fd77f6baSMauro Carvalho Chehab 1108fd77f6baSMauro Carvalho Chehab3) Corrected Error memory register counters 1109fd77f6baSMauro Carvalho Chehab 1110fd77f6baSMauro Carvalho Chehab Those newer MCs have some registers to count memory errors. The driver 1111fd77f6baSMauro Carvalho Chehab uses those registers to report Corrected Errors on devices with Registered 1112fd77f6baSMauro Carvalho Chehab DIMMs. 1113fd77f6baSMauro Carvalho Chehab 1114fd77f6baSMauro Carvalho Chehab However, those counters don't work with Unregistered DIMM. As the chipset 1115fd77f6baSMauro Carvalho Chehab offers some counters that also work with UDIMMs (but with a worse level of 1116fd77f6baSMauro Carvalho Chehab granularity than the default ones), the driver exposes those registers for 1117fd77f6baSMauro Carvalho Chehab UDIMM memories. 1118fd77f6baSMauro Carvalho Chehab 1119fd77f6baSMauro Carvalho Chehab They can be read by looking at the contents of ``all_channel_counts/``:: 1120fd77f6baSMauro Carvalho Chehab 1121fd77f6baSMauro Carvalho Chehab $ for i in /sys/devices/system/edac/mc/mc0/all_channel_counts/*; do echo $i; cat $i; done 1122fd77f6baSMauro Carvalho Chehab /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm0 1123fd77f6baSMauro Carvalho Chehab 0 1124fd77f6baSMauro Carvalho Chehab /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm1 1125fd77f6baSMauro Carvalho Chehab 0 1126fd77f6baSMauro Carvalho Chehab /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm2 1127fd77f6baSMauro Carvalho Chehab 0 1128fd77f6baSMauro Carvalho Chehab 1129fd77f6baSMauro Carvalho Chehab What happens here is that errors on different csrows, but at the same 1130fd77f6baSMauro Carvalho Chehab dimm number will increment the same counter. 1131fd77f6baSMauro Carvalho Chehab So, in this memory mapping:: 1132fd77f6baSMauro Carvalho Chehab 1133fd77f6baSMauro Carvalho Chehab csrow0: channel 0, dimm0 1134fd77f6baSMauro Carvalho Chehab csrow1: channel 0, dimm1 1135fd77f6baSMauro Carvalho Chehab csrow2: channel 1, dimm0 1136fd77f6baSMauro Carvalho Chehab csrow3: channel 2, dimm0 1137fd77f6baSMauro Carvalho Chehab 1138fd77f6baSMauro Carvalho Chehab The hardware will increment udimm0 for an error at the first dimm at either 1139fd77f6baSMauro Carvalho Chehab csrow0, csrow2 or csrow3; 1140fd77f6baSMauro Carvalho Chehab 1141fd77f6baSMauro Carvalho Chehab The hardware will increment udimm1 for an error at the second dimm at either 1142fd77f6baSMauro Carvalho Chehab csrow0, csrow2 or csrow3; 1143fd77f6baSMauro Carvalho Chehab 1144fd77f6baSMauro Carvalho Chehab The hardware will increment udimm2 for an error at the third dimm at either 1145fd77f6baSMauro Carvalho Chehab csrow0, csrow2 or csrow3; 1146fd77f6baSMauro Carvalho Chehab 1147fd77f6baSMauro Carvalho Chehab4) Standard error counters 1148fd77f6baSMauro Carvalho Chehab 1149fd77f6baSMauro Carvalho Chehab The standard error counters are generated when an mcelog error is received 1150fd77f6baSMauro Carvalho Chehab by the driver. Since, with UDIMM, this is counted by software, it is 1151fd77f6baSMauro Carvalho Chehab possible that some errors could be lost. With RDIMM's, they display the 1152fd77f6baSMauro Carvalho Chehab contents of the registers 1153fd77f6baSMauro Carvalho Chehab 1154fd77f6baSMauro Carvalho ChehabReference documents used on ``amd64_edac`` 1155fd77f6baSMauro Carvalho Chehab------------------------------------------ 1156fd77f6baSMauro Carvalho Chehab 1157fd77f6baSMauro Carvalho Chehab``amd64_edac`` module is based on the following documents 1158fd77f6baSMauro Carvalho Chehab(available from http://support.amd.com/en-us/search/tech-docs): 1159fd77f6baSMauro Carvalho Chehab 1160fd77f6baSMauro Carvalho Chehab1. :Title: BIOS and Kernel Developer's Guide for AMD Athlon 64 and AMD 1161fd77f6baSMauro Carvalho Chehab Opteron Processors 1162fd77f6baSMauro Carvalho Chehab :AMD publication #: 26094 1163fd77f6baSMauro Carvalho Chehab :Revision: 3.26 1164fd77f6baSMauro Carvalho Chehab :Link: http://support.amd.com/TechDocs/26094.PDF 1165fd77f6baSMauro Carvalho Chehab 1166fd77f6baSMauro Carvalho Chehab2. :Title: BIOS and Kernel Developer's Guide for AMD NPT Family 0Fh 1167fd77f6baSMauro Carvalho Chehab Processors 1168fd77f6baSMauro Carvalho Chehab :AMD publication #: 32559 1169fd77f6baSMauro Carvalho Chehab :Revision: 3.00 1170fd77f6baSMauro Carvalho Chehab :Issue Date: May 2006 1171fd77f6baSMauro Carvalho Chehab :Link: http://support.amd.com/TechDocs/32559.pdf 1172fd77f6baSMauro Carvalho Chehab 1173fd77f6baSMauro Carvalho Chehab3. :Title: BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h 1174fd77f6baSMauro Carvalho Chehab Processors 1175fd77f6baSMauro Carvalho Chehab :AMD publication #: 31116 1176fd77f6baSMauro Carvalho Chehab :Revision: 3.00 1177fd77f6baSMauro Carvalho Chehab :Issue Date: September 07, 2007 1178fd77f6baSMauro Carvalho Chehab :Link: http://support.amd.com/TechDocs/31116.pdf 1179fd77f6baSMauro Carvalho Chehab 1180fd77f6baSMauro Carvalho Chehab4. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h 1181fd77f6baSMauro Carvalho Chehab Models 30h-3Fh Processors 1182fd77f6baSMauro Carvalho Chehab :AMD publication #: 49125 1183fd77f6baSMauro Carvalho Chehab :Revision: 3.06 1184fd77f6baSMauro Carvalho Chehab :Issue Date: 2/12/2015 (latest release) 1185fd77f6baSMauro Carvalho Chehab :Link: http://support.amd.com/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf 1186fd77f6baSMauro Carvalho Chehab 1187fd77f6baSMauro Carvalho Chehab5. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h 1188fd77f6baSMauro Carvalho Chehab Models 60h-6Fh Processors 1189fd77f6baSMauro Carvalho Chehab :AMD publication #: 50742 1190fd77f6baSMauro Carvalho Chehab :Revision: 3.01 1191fd77f6baSMauro Carvalho Chehab :Issue Date: 7/23/2015 (latest release) 1192fd77f6baSMauro Carvalho Chehab :Link: http://support.amd.com/TechDocs/50742_15h_Models_60h-6Fh_BKDG.pdf 1193fd77f6baSMauro Carvalho Chehab 1194fd77f6baSMauro Carvalho Chehab6. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 16h 1195fd77f6baSMauro Carvalho Chehab Models 00h-0Fh Processors 1196fd77f6baSMauro Carvalho Chehab :AMD publication #: 48751 1197fd77f6baSMauro Carvalho Chehab :Revision: 3.03 1198fd77f6baSMauro Carvalho Chehab :Issue Date: 2/23/2015 (latest release) 1199fd77f6baSMauro Carvalho Chehab :Link: http://support.amd.com/TechDocs/48751_16h_bkdg.pdf 1200fd77f6baSMauro Carvalho Chehab 1201fd77f6baSMauro Carvalho ChehabCredits 1202fd77f6baSMauro Carvalho Chehab======= 1203fd77f6baSMauro Carvalho Chehab 1204fd77f6baSMauro Carvalho Chehab* Written by Doug Thompson <dougthompson@xmission.com> 1205fd77f6baSMauro Carvalho Chehab 1206fd77f6baSMauro Carvalho Chehab - 7 Dec 2005 1207fd77f6baSMauro Carvalho Chehab - 17 Jul 2007 Updated 1208fd77f6baSMauro Carvalho Chehab 1209fd77f6baSMauro Carvalho Chehab* |copy| Mauro Carvalho Chehab 1210fd77f6baSMauro Carvalho Chehab 1211fd77f6baSMauro Carvalho Chehab - 05 Aug 2009 Nehalem interface 1212fd77f6baSMauro Carvalho Chehab - 26 Oct 2016 Converted to ReST and cleanups at the Nehalem section 1213fd77f6baSMauro Carvalho Chehab 1214fd77f6baSMauro Carvalho Chehab* EDAC authors/maintainers: 1215fd77f6baSMauro Carvalho Chehab 1216fd77f6baSMauro Carvalho Chehab - Doug Thompson, Dave Jiang, Dave Peterson et al, 1217fd77f6baSMauro Carvalho Chehab - Mauro Carvalho Chehab 1218fd77f6baSMauro Carvalho Chehab - Borislav Petkov 1219fd77f6baSMauro Carvalho Chehab - original author: Thayne Harbaugh 1220