1.. include:: <isonum.txt> 2 3============================================ 4Reliability, Availability and Serviceability 5============================================ 6 7RAS concepts 8************ 9 10Reliability, Availability and Serviceability (RAS) is a concept used on 11servers meant to measure their robustness. 12 13Reliability 14 is the probability that a system will produce correct outputs. 15 16 * Generally measured as Mean Time Between Failures (MTBF) 17 * Enhanced by features that help to avoid, detect and repair hardware faults 18 19Availability 20 is the probability that a system is operational at a given time 21 22 * Generally measured as a percentage of downtime per a period of time 23 * Often uses mechanisms to detect and correct hardware faults in 24 runtime; 25 26Serviceability (or maintainability) 27 is the simplicity and speed with which a system can be repaired or 28 maintained 29 30 * Generally measured on Mean Time Between Repair (MTBR) 31 32Improving RAS 33------------- 34 35In order to reduce systems downtime, a system should be capable of detecting 36hardware errors, and, when possible correcting them in runtime. It should 37also provide mechanisms to detect hardware degradation, in order to warn 38the system administrator to take the action of replacing a component before 39it causes data loss or system downtime. 40 41Among the monitoring measures, the most usual ones include: 42 43* CPU – detect errors at instruction execution and at L1/L2/L3 caches; 44* Memory – add error correction logic (ECC) to detect and correct errors; 45* I/O – add CRC checksums for transferred data; 46* Storage – RAID, journal file systems, checksums, 47 Self-Monitoring, Analysis and Reporting Technology (SMART). 48 49By monitoring the number of occurrences of error detections, it is possible 50to identify if the probability of hardware errors is increasing, and, on such 51case, do a preventive maintenance to replace a degraded component while 52those errors are correctable. 53 54Types of errors 55--------------- 56 57Most mechanisms used on modern systems use use technologies like Hamming 58Codes that allow error correction when the number of errors on a bit packet 59is below a threshold. If the number of errors is above, those mechanisms 60can indicate with a high degree of confidence that an error happened, but 61they can't correct. 62 63Also, sometimes an error occur on a component that it is not used. For 64example, a part of the memory that it is not currently allocated. 65 66That defines some categories of errors: 67 68* **Correctable Error (CE)** - the error detection mechanism detected and 69 corrected the error. Such errors are usually not fatal, although some 70 Kernel mechanisms allow the system administrator to consider them as fatal. 71 72* **Uncorrected Error (UE)** - the amount of errors happened above the error 73 correction threshold, and the system was unable to auto-correct. 74 75* **Fatal Error** - when an UE error happens on a critical component of the 76 system (for example, a piece of the Kernel got corrupted by an UE), the 77 only reliable way to avoid data corruption is to hang or reboot the machine. 78 79* **Non-fatal Error** - when an UE error happens on an unused component, 80 like a CPU in power down state or an unused memory bank, the system may 81 still run, eventually replacing the affected hardware by a hot spare, 82 if available. 83 84 Also, when an error happens on a userspace process, it is also possible to 85 kill such process and let userspace restart it. 86 87The mechanism for handling non-fatal errors is usually complex and may 88require the help of some userspace application, in order to apply the 89policy desired by the system administrator. 90 91Identifying a bad hardware component 92------------------------------------ 93 94Just detecting a hardware flaw is usually not enough, as the system needs 95to pinpoint to the minimal replaceable unit (MRU) that should be exchanged 96to make the hardware reliable again. 97 98So, it requires not only error logging facilities, but also mechanisms that 99will translate the error message to the silkscreen or component label for 100the MRU. 101 102Typically, it is very complex for memory, as modern CPUs interlace memory 103from different memory modules, in order to provide a better performance. The 104DMI BIOS usually have a list of memory module labels, with can be obtained 105using the ``dmidecode`` tool. For example, on a desktop machine, it shows:: 106 107 Memory Device 108 Total Width: 64 bits 109 Data Width: 64 bits 110 Size: 16384 MB 111 Form Factor: SODIMM 112 Set: None 113 Locator: ChannelA-DIMM0 114 Bank Locator: BANK 0 115 Type: DDR4 116 Type Detail: Synchronous 117 Speed: 2133 MHz 118 Rank: 2 119 Configured Clock Speed: 2133 MHz 120 121On the above example, a DDR4 SO-DIMM memory module is located at the 122system's memory labeled as "BANK 0", as given by the *bank locator* field. 123Please notice that, on such system, the *total width* is equal to the 124*data width*. It means that such memory module doesn't have error 125detection/correction mechanisms. 126 127Unfortunately, not all systems use the same field to specify the memory 128bank. On this example, from an older server, ``dmidecode`` shows:: 129 130 Memory Device 131 Array Handle: 0x1000 132 Error Information Handle: Not Provided 133 Total Width: 72 bits 134 Data Width: 64 bits 135 Size: 8192 MB 136 Form Factor: DIMM 137 Set: 1 138 Locator: DIMM_A1 139 Bank Locator: Not Specified 140 Type: DDR3 141 Type Detail: Synchronous Registered (Buffered) 142 Speed: 1600 MHz 143 Rank: 2 144 Configured Clock Speed: 1600 MHz 145 146There, the DDR3 RDIMM memory module is located at the system's memory labeled 147as "DIMM_A1", as given by the *locator* field. Please notice that this 148memory module has 64 bits of *data width* and 72 bits of *total width*. So, 149it has 8 extra bits to be used by error detection and correction mechanisms. 150Such kind of memory is called Error-correcting code memory (ECC memory). 151 152To make things even worse, it is not uncommon that systems with different 153labels on their system's board to use exactly the same BIOS, meaning that 154the labels provided by the BIOS won't match the real ones. 155 156ECC memory 157---------- 158 159As mentioned on the previous section, ECC memory has extra bits to be 160used for error correction. So, on 64 bit systems, a memory module 161has 64 bits of *data width*, and 74 bits of *total width*. So, there are 1628 bits extra bits to be used for the error detection and correction 163mechanisms. Those extra bits are called *syndrome*\ [#f1]_\ [#f2]_. 164 165So, when the cpu requests the memory controller to write a word with 166*data width*, the memory controller calculates the *syndrome* in real time, 167using Hamming code, or some other error correction code, like SECDED+, 168producing a code with *total width* size. Such code is then written 169on the memory modules. 170 171At read, the *total width* bits code is converted back, using the same 172ECC code used on write, producing a word with *data width* and a *syndrome*. 173The word with *data width* is sent to the CPU, even when errors happen. 174 175The memory controller also looks at the *syndrome* in order to check if 176there was an error, and if the ECC code was able to fix such error. 177If the error was corrected, a Corrected Error (CE) happened. If not, an 178Uncorrected Error (UE) happened. 179 180The information about the CE/UE errors is stored on some special registers 181at the memory controller and can be accessed by reading such registers, 182either by BIOS, by some special CPUs or by Linux EDAC driver. On x86 64 183bit CPUs, such errors can also be retrieved via the Machine Check 184Architecture (MCA)\ [#f3]_. 185 186.. [#f1] Please notice that several memory controllers allow operation on a 187 mode called "Lock-Step", where it groups two memory modules together, 188 doing 128-bit reads/writes. That gives 16 bits for error correction, with 189 significantly improves the error correction mechanism, at the expense 190 that, when an error happens, there's no way to know what memory module is 191 to blame. So, it has to blame both memory modules. 192 193.. [#f2] Some memory controllers also allow using memory in mirror mode. 194 On such mode, the same data is written to two memory modules. At read, 195 the system checks both memory modules, in order to check if both provide 196 identical data. On such configuration, when an error happens, there's no 197 way to know what memory module is to blame. So, it has to blame both 198 memory modules (or 4 memory modules, if the system is also on Lock-step 199 mode). 200 201.. [#f3] For more details about the Machine Check Architecture (MCA), 202 please read Documentation/x86/x86_64/machinecheck at the Kernel tree. 203 204EDAC - Error Detection And Correction 205************************************* 206 207.. note:: 208 209 "bluesmoke" was the name for this device driver subsystem when it 210 was "out-of-tree" and maintained at http://bluesmoke.sourceforge.net. 211 That site is mostly archaic now and can be used only for historical 212 purposes. 213 214 When the subsystem was pushed upstream for the first time, on 215 Kernel 2.6.16, for the first time, it was renamed to ``EDAC``. 216 217Purpose 218------- 219 220The ``edac`` kernel module's goal is to detect and report hardware errors 221that occur within the computer system running under linux. 222 223Memory 224------ 225 226Memory Correctable Errors (CE) and Uncorrectable Errors (UE) are the 227primary errors being harvested. These types of errors are harvested by 228the ``edac_mc`` device. 229 230Detecting CE events, then harvesting those events and reporting them, 231**can** but must not necessarily be a predictor of future UE events. With 232CE events only, the system can and will continue to operate as no data 233has been damaged yet. 234 235However, preventive maintenance and proactive part replacement of memory 236modules exhibiting CEs can reduce the likelihood of the dreaded UE events 237and system panics. 238 239Other hardware elements 240----------------------- 241 242A new feature for EDAC, the ``edac_device`` class of device, was added in 243the 2.6.23 version of the kernel. 244 245This new device type allows for non-memory type of ECC hardware detectors 246to have their states harvested and presented to userspace via the sysfs 247interface. 248 249Some architectures have ECC detectors for L1, L2 and L3 caches, 250along with DMA engines, fabric switches, main data path switches, 251interconnections, and various other hardware data paths. If the hardware 252reports it, then a edac_device device probably can be constructed to 253harvest and present that to userspace. 254 255 256PCI bus scanning 257---------------- 258 259In addition, PCI devices are scanned for PCI Bus Parity and SERR Errors 260in order to determine if errors are occurring during data transfers. 261 262The presence of PCI Parity errors must be examined with a grain of salt. 263There are several add-in adapters that do **not** follow the PCI specification 264with regards to Parity generation and reporting. The specification says 265the vendor should tie the parity status bits to 0 if they do not intend 266to generate parity. Some vendors do not do this, and thus the parity bit 267can "float" giving false positives. 268 269There is a PCI device attribute located in sysfs that is checked by 270the EDAC PCI scanning code. If that attribute is set, PCI parity/error 271scanning is skipped for that device. The attribute is:: 272 273 broken_parity_status 274 275and is located in ``/sys/devices/pci<XXX>/0000:XX:YY.Z`` directories for 276PCI devices. 277 278 279Versioning 280---------- 281 282EDAC is composed of a "core" module (``edac_core.ko``) and several Memory 283Controller (MC) driver modules. On a given system, the CORE is loaded 284and one MC driver will be loaded. Both the CORE and the MC driver (or 285``edac_device`` driver) have individual versions that reflect current 286release level of their respective modules. 287 288Thus, to "report" on what version a system is running, one must report 289both the CORE's and the MC driver's versions. 290 291 292Loading 293------- 294 295If ``edac`` was statically linked with the kernel then no loading 296is necessary. If ``edac`` was built as modules then simply modprobe 297the ``edac`` pieces that you need. You should be able to modprobe 298hardware-specific modules and have the dependencies load the necessary 299core modules. 300 301Example:: 302 303 $ modprobe amd76x_edac 304 305loads both the ``amd76x_edac.ko`` memory controller module and the 306``edac_mc.ko`` core module. 307 308 309Sysfs interface 310--------------- 311 312EDAC presents a ``sysfs`` interface for control and reporting purposes. It 313lives in the /sys/devices/system/edac directory. 314 315Within this directory there currently reside 2 components: 316 317 ======= ============================== 318 mc memory controller(s) system 319 pci PCI control and status system 320 ======= ============================== 321 322 323 324Memory Controller (mc) Model 325---------------------------- 326 327Each ``mc`` device controls a set of memory modules [#f4]_. These modules 328are laid out in a Chip-Select Row (``csrowX``) and Channel table (``chX``). 329There can be multiple csrows and multiple channels. 330 331.. [#f4] Nowadays, the term DIMM (Dual In-line Memory Module) is widely 332 used to refer to a memory module, although there are other memory 333 packaging alternatives, like SO-DIMM, SIMM, etc. Along this document, 334 and inside the EDAC system, the term "dimm" is used for all memory 335 modules, even when they use a different kind of packaging. 336 337Memory controllers allow for several csrows, with 8 csrows being a 338typical value. Yet, the actual number of csrows depends on the layout of 339a given motherboard, memory controller and memory module characteristics. 340 341Dual channels allow for dual data length (e. g. 128 bits, on 64 bit systems) 342data transfers to/from the CPU from/to memory. Some newer chipsets allow 343for more than 2 channels, like Fully Buffered DIMMs (FB-DIMMs) memory 344controllers. The following example will assume 2 channels: 345 346 +------------+-----------------------+ 347 | Chip | Channels | 348 | Select +-----------+-----------+ 349 | rows | ``ch0`` | ``ch1`` | 350 +============+===========+===========+ 351 | ``csrow0`` | DIMM_A0 | DIMM_B0 | 352 +------------+ | | 353 | ``csrow1`` | | | 354 +------------+-----------+-----------+ 355 | ``csrow2`` | DIMM_A1 | DIMM_B1 | 356 +------------+ | | 357 | ``csrow3`` | | | 358 +------------+-----------+-----------+ 359 360In the above example, there are 4 physical slots on the motherboard 361for memory DIMMs: 362 363 +---------+---------+ 364 | DIMM_A0 | DIMM_B0 | 365 +---------+---------+ 366 | DIMM_A1 | DIMM_B1 | 367 +---------+---------+ 368 369Labels for these slots are usually silk-screened on the motherboard. 370Slots labeled ``A`` are channel 0 in this example. Slots labeled ``B`` are 371channel 1. Notice that there are two csrows possible on a physical DIMM. 372These csrows are allocated their csrow assignment based on the slot into 373which the memory DIMM is placed. Thus, when 1 DIMM is placed in each 374Channel, the csrows cross both DIMMs. 375 376Memory DIMMs come single or dual "ranked". A rank is a populated csrow. 377Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above 378will have just one csrow (csrow0). csrow1 will be empty. On the other 379hand, when 2 dual ranked DIMMs are similarly placed, then both csrow0 380and csrow1 will be populated. The pattern repeats itself for csrow2 and 381csrow3. 382 383The representation of the above is reflected in the directory 384tree in EDAC's sysfs interface. Starting in directory 385``/sys/devices/system/edac/mc``, each memory controller will be 386represented by its own ``mcX`` directory, where ``X`` is the 387index of the MC:: 388 389 ..../edac/mc/ 390 | 391 |->mc0 392 |->mc1 393 |->mc2 394 .... 395 396Under each ``mcX`` directory each ``csrowX`` is again represented by a 397``csrowX``, where ``X`` is the csrow index:: 398 399 .../mc/mc0/ 400 | 401 |->csrow0 402 |->csrow2 403 |->csrow3 404 .... 405 406Notice that there is no csrow1, which indicates that csrow0 is composed 407of a single ranked DIMMs. This should also apply in both Channels, in 408order to have dual-channel mode be operational. Since both csrow2 and 409csrow3 are populated, this indicates a dual ranked set of DIMMs for 410channels 0 and 1. 411 412Within each of the ``mcX`` and ``csrowX`` directories are several EDAC 413control and attribute files. 414 415``mcX`` directories 416------------------- 417 418In ``mcX`` directories are EDAC control and attribute files for 419this ``X`` instance of the memory controllers. 420 421For a description of the sysfs API, please see: 422 423 Documentation/ABI/testing/sysfs-devices-edac 424 425 426``dimmX`` or ``rankX`` directories 427---------------------------------- 428 429The recommended way to use the EDAC subsystem is to look at the information 430provided by the ``dimmX`` or ``rankX`` directories [#f5]_. 431 432A typical EDAC system has the following structure under 433``/sys/devices/system/edac/``\ [#f6]_:: 434 435 /sys/devices/system/edac/ 436 ├── mc 437 │ ├── mc0 438 │ │ ├── ce_count 439 │ │ ├── ce_noinfo_count 440 │ │ ├── dimm0 441 │ │ │ ├── dimm_ce_count 442 │ │ │ ├── dimm_dev_type 443 │ │ │ ├── dimm_edac_mode 444 │ │ │ ├── dimm_label 445 │ │ │ ├── dimm_location 446 │ │ │ ├── dimm_mem_type 447 │ │ │ ├── dimm_ue_count 448 │ │ │ ├── size 449 │ │ │ └── uevent 450 │ │ ├── max_location 451 │ │ ├── mc_name 452 │ │ ├── reset_counters 453 │ │ ├── seconds_since_reset 454 │ │ ├── size_mb 455 │ │ ├── ue_count 456 │ │ ├── ue_noinfo_count 457 │ │ └── uevent 458 │ ├── mc1 459 │ │ ├── ce_count 460 │ │ ├── ce_noinfo_count 461 │ │ ├── dimm0 462 │ │ │ ├── dimm_ce_count 463 │ │ │ ├── dimm_dev_type 464 │ │ │ ├── dimm_edac_mode 465 │ │ │ ├── dimm_label 466 │ │ │ ├── dimm_location 467 │ │ │ ├── dimm_mem_type 468 │ │ │ ├── dimm_ue_count 469 │ │ │ ├── size 470 │ │ │ └── uevent 471 │ │ ├── max_location 472 │ │ ├── mc_name 473 │ │ ├── reset_counters 474 │ │ ├── seconds_since_reset 475 │ │ ├── size_mb 476 │ │ ├── ue_count 477 │ │ ├── ue_noinfo_count 478 │ │ └── uevent 479 │ └── uevent 480 └── uevent 481 482In the ``dimmX`` directories are EDAC control and attribute files for 483this ``X`` memory module: 484 485- ``size`` - Total memory managed by this csrow attribute file 486 487 This attribute file displays, in count of megabytes, the memory 488 that this csrow contains. 489 490- ``dimm_ue_count`` - Uncorrectable Errors count attribute file 491 492 This attribute file displays the total count of uncorrectable 493 errors that have occurred on this DIMM. If panic_on_ue is set 494 this counter will not have a chance to increment, since EDAC 495 will panic the system. 496 497- ``dimm_ce_count`` - Correctable Errors count attribute file 498 499 This attribute file displays the total count of correctable 500 errors that have occurred on this DIMM. This count is very 501 important to examine. CEs provide early indications that a 502 DIMM is beginning to fail. This count field should be 503 monitored for non-zero values and report such information 504 to the system administrator. 505 506- ``dimm_dev_type`` - Device type attribute file 507 508 This attribute file will display what type of DRAM device is 509 being utilized on this DIMM. 510 Examples: 511 512 - x1 513 - x2 514 - x4 515 - x8 516 517- ``dimm_edac_mode`` - EDAC Mode of operation attribute file 518 519 This attribute file will display what type of Error detection 520 and correction is being utilized. 521 522- ``dimm_label`` - memory module label control file 523 524 This control file allows this DIMM to have a label assigned 525 to it. With this label in the module, when errors occur 526 the output can provide the DIMM label in the system log. 527 This becomes vital for panic events to isolate the 528 cause of the UE event. 529 530 DIMM Labels must be assigned after booting, with information 531 that correctly identifies the physical slot with its 532 silk screen label. This information is currently very 533 motherboard specific and determination of this information 534 must occur in userland at this time. 535 536- ``dimm_location`` - location of the memory module 537 538 The location can have up to 3 levels, and describe how the 539 memory controller identifies the location of a memory module. 540 Depending on the type of memory and memory controller, it 541 can be: 542 543 - *csrow* and *channel* - used when the memory controller 544 doesn't identify a single DIMM - e. g. in ``rankX`` dir; 545 - *branch*, *channel*, *slot* - typically used on FB-DIMM memory 546 controllers; 547 - *channel*, *slot* - used on Nehalem and newer Intel drivers. 548 549- ``dimm_mem_type`` - Memory Type attribute file 550 551 This attribute file will display what type of memory is currently 552 on this csrow. Normally, either buffered or unbuffered memory. 553 Examples: 554 555 - Registered-DDR 556 - Unbuffered-DDR 557 558.. [#f5] On some systems, the memory controller doesn't have any logic 559 to identify the memory module. On such systems, the directory is called ``rankX`` and works on a similar way as the ``csrowX`` directories. 560 On modern Intel memory controllers, the memory controller identifies the 561 memory modules directly. On such systems, the directory is called ``dimmX``. 562 563.. [#f6] There are also some ``power`` directories and ``subsystem`` 564 symlinks inside the sysfs mapping that are automatically created by 565 the sysfs subsystem. Currently, they serve no purpose. 566 567``csrowX`` directories 568---------------------- 569 570When CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the ``csrowX`` 571directories. As this API doesn't work properly for Rambus, FB-DIMMs and 572modern Intel Memory Controllers, this is being deprecated in favor of 573``dimmX`` directories. 574 575In the ``csrowX`` directories are EDAC control and attribute files for 576this ``X`` instance of csrow: 577 578 579- ``ue_count`` - Total Uncorrectable Errors count attribute file 580 581 This attribute file displays the total count of uncorrectable 582 errors that have occurred on this csrow. If panic_on_ue is set 583 this counter will not have a chance to increment, since EDAC 584 will panic the system. 585 586 587- ``ce_count`` - Total Correctable Errors count attribute file 588 589 This attribute file displays the total count of correctable 590 errors that have occurred on this csrow. This count is very 591 important to examine. CEs provide early indications that a 592 DIMM is beginning to fail. This count field should be 593 monitored for non-zero values and report such information 594 to the system administrator. 595 596 597- ``size_mb`` - Total memory managed by this csrow attribute file 598 599 This attribute file displays, in count of megabytes, the memory 600 that this csrow contains. 601 602 603- ``mem_type`` - Memory Type attribute file 604 605 This attribute file will display what type of memory is currently 606 on this csrow. Normally, either buffered or unbuffered memory. 607 Examples: 608 609 - Registered-DDR 610 - Unbuffered-DDR 611 612 613- ``edac_mode`` - EDAC Mode of operation attribute file 614 615 This attribute file will display what type of Error detection 616 and correction is being utilized. 617 618 619- ``dev_type`` - Device type attribute file 620 621 This attribute file will display what type of DRAM device is 622 being utilized on this DIMM. 623 Examples: 624 625 - x1 626 - x2 627 - x4 628 - x8 629 630 631- ``ch0_ce_count`` - Channel 0 CE Count attribute file 632 633 This attribute file will display the count of CEs on this 634 DIMM located in channel 0. 635 636 637- ``ch0_ue_count`` - Channel 0 UE Count attribute file 638 639 This attribute file will display the count of UEs on this 640 DIMM located in channel 0. 641 642 643- ``ch0_dimm_label`` - Channel 0 DIMM Label control file 644 645 646 This control file allows this DIMM to have a label assigned 647 to it. With this label in the module, when errors occur 648 the output can provide the DIMM label in the system log. 649 This becomes vital for panic events to isolate the 650 cause of the UE event. 651 652 DIMM Labels must be assigned after booting, with information 653 that correctly identifies the physical slot with its 654 silk screen label. This information is currently very 655 motherboard specific and determination of this information 656 must occur in userland at this time. 657 658 659- ``ch1_ce_count`` - Channel 1 CE Count attribute file 660 661 662 This attribute file will display the count of CEs on this 663 DIMM located in channel 1. 664 665 666- ``ch1_ue_count`` - Channel 1 UE Count attribute file 667 668 669 This attribute file will display the count of UEs on this 670 DIMM located in channel 0. 671 672 673- ``ch1_dimm_label`` - Channel 1 DIMM Label control file 674 675 This control file allows this DIMM to have a label assigned 676 to it. With this label in the module, when errors occur 677 the output can provide the DIMM label in the system log. 678 This becomes vital for panic events to isolate the 679 cause of the UE event. 680 681 DIMM Labels must be assigned after booting, with information 682 that correctly identifies the physical slot with its 683 silk screen label. This information is currently very 684 motherboard specific and determination of this information 685 must occur in userland at this time. 686 687 688System Logging 689-------------- 690 691If logging for UEs and CEs is enabled, then system logs will contain 692information indicating that errors have been detected:: 693 694 EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, channel 1 "DIMM_B1": amd76x_edac 695 EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0, channel 1 "DIMM_B1": amd76x_edac 696 697 698The structure of the message is: 699 700 +---------------------------------------+-------------+ 701 | Content + Example | 702 +=======================================+=============+ 703 | The memory controller | MC0 | 704 +---------------------------------------+-------------+ 705 | Error type | CE | 706 +---------------------------------------+-------------+ 707 | Memory page | 0x283 | 708 +---------------------------------------+-------------+ 709 | Offset in the page | 0xce0 | 710 +---------------------------------------+-------------+ 711 | The byte granularity | grain 8 | 712 | or resolution of the error | | 713 +---------------------------------------+-------------+ 714 | The error syndrome | 0xb741 | 715 +---------------------------------------+-------------+ 716 | Memory row | row 0 + 717 +---------------------------------------+-------------+ 718 | Memory channel | channel 1 | 719 +---------------------------------------+-------------+ 720 | DIMM label, if set prior | DIMM B1 | 721 +---------------------------------------+-------------+ 722 | And then an optional, driver-specific | | 723 | message that may have additional | | 724 | information. | | 725 +---------------------------------------+-------------+ 726 727Both UEs and CEs with no info will lack all but memory controller, error 728type, a notice of "no info" and then an optional, driver-specific error 729message. 730 731 732PCI Bus Parity Detection 733------------------------ 734 735On Header Type 00 devices, the primary status is looked at for any 736parity error regardless of whether parity is enabled on the device or 737not. (The spec indicates parity is generated in some cases). On Header 738Type 01 bridges, the secondary status register is also looked at to see 739if parity occurred on the bus on the other side of the bridge. 740 741 742Sysfs configuration 743------------------- 744 745Under ``/sys/devices/system/edac/pci`` are control and attribute files as 746follows: 747 748 749- ``check_pci_parity`` - Enable/Disable PCI Parity checking control file 750 751 This control file enables or disables the PCI Bus Parity scanning 752 operation. Writing a 1 to this file enables the scanning. Writing 753 a 0 to this file disables the scanning. 754 755 Enable:: 756 757 echo "1" >/sys/devices/system/edac/pci/check_pci_parity 758 759 Disable:: 760 761 echo "0" >/sys/devices/system/edac/pci/check_pci_parity 762 763 764- ``pci_parity_count`` - Parity Count 765 766 This attribute file will display the number of parity errors that 767 have been detected. 768 769 770Module parameters 771----------------- 772 773- ``edac_mc_panic_on_ue`` - Panic on UE control file 774 775 An uncorrectable error will cause a machine panic. This is usually 776 desirable. It is a bad idea to continue when an uncorrectable error 777 occurs - it is indeterminate what was uncorrected and the operating 778 system context might be so mangled that continuing will lead to further 779 corruption. If the kernel has MCE configured, then EDAC will never 780 notice the UE. 781 782 LOAD TIME:: 783 784 module/kernel parameter: edac_mc_panic_on_ue=[0|1] 785 786 RUN TIME:: 787 788 echo "1" > /sys/module/edac_core/parameters/edac_mc_panic_on_ue 789 790 791- ``edac_mc_log_ue`` - Log UE control file 792 793 794 Generate kernel messages describing uncorrectable errors. These errors 795 are reported through the system message log system. UE statistics 796 will be accumulated even when UE logging is disabled. 797 798 LOAD TIME:: 799 800 module/kernel parameter: edac_mc_log_ue=[0|1] 801 802 RUN TIME:: 803 804 echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ue 805 806 807- ``edac_mc_log_ce`` - Log CE control file 808 809 810 Generate kernel messages describing correctable errors. These 811 errors are reported through the system message log system. 812 CE statistics will be accumulated even when CE logging is disabled. 813 814 LOAD TIME:: 815 816 module/kernel parameter: edac_mc_log_ce=[0|1] 817 818 RUN TIME:: 819 820 echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ce 821 822 823- ``edac_mc_poll_msec`` - Polling period control file 824 825 826 The time period, in milliseconds, for polling for error information. 827 Too small a value wastes resources. Too large a value might delay 828 necessary handling of errors and might loose valuable information for 829 locating the error. 1000 milliseconds (once each second) is the current 830 default. Systems which require all the bandwidth they can get, may 831 increase this. 832 833 LOAD TIME:: 834 835 module/kernel parameter: edac_mc_poll_msec=[0|1] 836 837 RUN TIME:: 838 839 echo "1000" > /sys/module/edac_core/parameters/edac_mc_poll_msec 840 841 842- ``panic_on_pci_parity`` - Panic on PCI PARITY Error 843 844 845 This control file enables or disables panicking when a parity 846 error has been detected. 847 848 849 module/kernel parameter:: 850 851 edac_panic_on_pci_pe=[0|1] 852 853 Enable:: 854 855 echo "1" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe 856 857 Disable:: 858 859 echo "0" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe 860 861 862 863EDAC device type 864---------------- 865 866In the header file, edac_pci.h, there is a series of edac_device structures 867and APIs for the EDAC_DEVICE. 868 869User space access to an edac_device is through the sysfs interface. 870 871At the location ``/sys/devices/system/edac`` (sysfs) new edac_device devices 872will appear. 873 874There is a three level tree beneath the above ``edac`` directory. For example, 875the ``test_device_edac`` device (found at the http://bluesmoke.sourceforget.net 876website) installs itself as:: 877 878 /sys/devices/system/edac/test-instance 879 880in this directory are various controls, a symlink and one or more ``instance`` 881directories. 882 883The standard default controls are: 884 885 ============== ======================================================= 886 log_ce boolean to log CE events 887 log_ue boolean to log UE events 888 panic_on_ue boolean to ``panic`` the system if an UE is encountered 889 (default off, can be set true via startup script) 890 poll_msec time period between POLL cycles for events 891 ============== ======================================================= 892 893The test_device_edac device adds at least one of its own custom control: 894 895 ============== ================================================== 896 test_bits which in the current test driver does nothing but 897 show how it is installed. A ported driver can 898 add one or more such controls and/or attributes 899 for specific uses. 900 One out-of-tree driver uses controls here to allow 901 for ERROR INJECTION operations to hardware 902 injection registers 903 ============== ================================================== 904 905The symlink points to the 'struct dev' that is registered for this edac_device. 906 907Instances 908--------- 909 910One or more instance directories are present. For the ``test_device_edac`` 911case: 912 913 +----------------+ 914 | test-instance0 | 915 +----------------+ 916 917 918In this directory there are two default counter attributes, which are totals of 919counter in deeper subdirectories. 920 921 ============== ==================================== 922 ce_count total of CE events of subdirectories 923 ue_count total of UE events of subdirectories 924 ============== ==================================== 925 926Blocks 927------ 928 929At the lowest directory level is the ``block`` directory. There can be 0, 1 930or more blocks specified in each instance: 931 932 +-------------+ 933 | test-block0 | 934 +-------------+ 935 936In this directory the default attributes are: 937 938 ============== ================================================ 939 ce_count which is counter of CE events for this ``block`` 940 of hardware being monitored 941 ue_count which is counter of UE events for this ``block`` 942 of hardware being monitored 943 ============== ================================================ 944 945 946The ``test_device_edac`` device adds 4 attributes and 1 control: 947 948 ================== ==================================================== 949 test-block-bits-0 for every POLL cycle this counter 950 is incremented 951 test-block-bits-1 every 10 cycles, this counter is bumped once, 952 and test-block-bits-0 is set to 0 953 test-block-bits-2 every 100 cycles, this counter is bumped once, 954 and test-block-bits-1 is set to 0 955 test-block-bits-3 every 1000 cycles, this counter is bumped once, 956 and test-block-bits-2 is set to 0 957 ================== ==================================================== 958 959 960 ================== ==================================================== 961 reset-counters writing ANY thing to this control will 962 reset all the above counters. 963 ================== ==================================================== 964 965 966Use of the ``test_device_edac`` driver should enable any others to create their own 967unique drivers for their hardware systems. 968 969The ``test_device_edac`` sample driver is located at the 970http://bluesmoke.sourceforge.net project site for EDAC. 971 972 973Usage of EDAC APIs on Nehalem and newer Intel CPUs 974-------------------------------------------------- 975 976On older Intel architectures, the memory controller was part of the North 977Bridge chipset. Nehalem, Sandy Bridge, Ivy Bridge, Haswell, Sky Lake and 978newer Intel architectures integrated an enhanced version of the memory 979controller (MC) inside the CPUs. 980 981This chapter will cover the differences of the enhanced memory controllers 982found on newer Intel CPUs, such as ``i7core_edac``, ``sb_edac`` and 983``sbx_edac`` drivers. 984 985.. note:: 986 987 The Xeon E7 processor families use a separate chip for the memory 988 controller, called Intel Scalable Memory Buffer. This section doesn't 989 apply for such families. 990 9911) There is one Memory Controller per Quick Patch Interconnect 992 (QPI). At the driver, the term "socket" means one QPI. This is 993 associated with a physical CPU socket. 994 995 Each MC have 3 physical read channels, 3 physical write channels and 996 3 logic channels. The driver currently sees it as just 3 channels. 997 Each channel can have up to 3 DIMMs. 998 999 The minimum known unity is DIMMs. There are no information about csrows. 1000 As EDAC API maps the minimum unity is csrows, the driver sequentially 1001 maps channel/DIMM into different csrows. 1002 1003 For example, supposing the following layout:: 1004 1005 Ch0 phy rd0, wr0 (0x063f4031): 2 ranks, UDIMMs 1006 dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 1007 dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400 1008 Ch1 phy rd1, wr1 (0x063f4031): 2 ranks, UDIMMs 1009 dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 1010 Ch2 phy rd3, wr3 (0x063f4031): 2 ranks, UDIMMs 1011 dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 1012 1013 The driver will map it as:: 1014 1015 csrow0: channel 0, dimm0 1016 csrow1: channel 0, dimm1 1017 csrow2: channel 1, dimm0 1018 csrow3: channel 2, dimm0 1019 1020 exports one DIMM per csrow. 1021 1022 Each QPI is exported as a different memory controller. 1023 10242) The MC has the ability to inject errors to test drivers. The drivers 1025 implement this functionality via some error injection nodes: 1026 1027 For injecting a memory error, there are some sysfs nodes, under 1028 ``/sys/devices/system/edac/mc/mc?/``: 1029 1030 - ``inject_addrmatch/*``: 1031 Controls the error injection mask register. It is possible to specify 1032 several characteristics of the address to match an error code:: 1033 1034 dimm = the affected dimm. Numbers are relative to a channel; 1035 rank = the memory rank; 1036 channel = the channel that will generate an error; 1037 bank = the affected bank; 1038 page = the page address; 1039 column (or col) = the address column. 1040 1041 each of the above values can be set to "any" to match any valid value. 1042 1043 At driver init, all values are set to any. 1044 1045 For example, to generate an error at rank 1 of dimm 2, for any channel, 1046 any bank, any page, any column:: 1047 1048 echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm 1049 echo 1 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank 1050 1051 To return to the default behaviour of matching any, you can do:: 1052 1053 echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm 1054 echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank 1055 1056 - ``inject_eccmask``: 1057 specifies what bits will have troubles, 1058 1059 - ``inject_section``: 1060 specifies what ECC cache section will get the error:: 1061 1062 3 for both 1063 2 for the highest 1064 1 for the lowest 1065 1066 - ``inject_type``: 1067 specifies the type of error, being a combination of the following bits:: 1068 1069 bit 0 - repeat 1070 bit 1 - ecc 1071 bit 2 - parity 1072 1073 - ``inject_enable``: 1074 starts the error generation when something different than 0 is written. 1075 1076 All inject vars can be read. root permission is needed for write. 1077 1078 Datasheet states that the error will only be generated after a write on an 1079 address that matches inject_addrmatch. It seems, however, that reading will 1080 also produce an error. 1081 1082 For example, the following code will generate an error for any write access 1083 at socket 0, on any DIMM/address on channel 2:: 1084 1085 echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/channel 1086 echo 2 >/sys/devices/system/edac/mc/mc0/inject_type 1087 echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask 1088 echo 3 >/sys/devices/system/edac/mc/mc0/inject_section 1089 echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable 1090 dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null 1091 1092 For socket 1, it is needed to replace "mc0" by "mc1" at the above 1093 commands. 1094 1095 The generated error message will look like:: 1096 1097 EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error)) 1098 10993) Corrected Error memory register counters 1100 1101 Those newer MCs have some registers to count memory errors. The driver 1102 uses those registers to report Corrected Errors on devices with Registered 1103 DIMMs. 1104 1105 However, those counters don't work with Unregistered DIMM. As the chipset 1106 offers some counters that also work with UDIMMs (but with a worse level of 1107 granularity than the default ones), the driver exposes those registers for 1108 UDIMM memories. 1109 1110 They can be read by looking at the contents of ``all_channel_counts/``:: 1111 1112 $ for i in /sys/devices/system/edac/mc/mc0/all_channel_counts/*; do echo $i; cat $i; done 1113 /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm0 1114 0 1115 /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm1 1116 0 1117 /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm2 1118 0 1119 1120 What happens here is that errors on different csrows, but at the same 1121 dimm number will increment the same counter. 1122 So, in this memory mapping:: 1123 1124 csrow0: channel 0, dimm0 1125 csrow1: channel 0, dimm1 1126 csrow2: channel 1, dimm0 1127 csrow3: channel 2, dimm0 1128 1129 The hardware will increment udimm0 for an error at the first dimm at either 1130 csrow0, csrow2 or csrow3; 1131 1132 The hardware will increment udimm1 for an error at the second dimm at either 1133 csrow0, csrow2 or csrow3; 1134 1135 The hardware will increment udimm2 for an error at the third dimm at either 1136 csrow0, csrow2 or csrow3; 1137 11384) Standard error counters 1139 1140 The standard error counters are generated when an mcelog error is received 1141 by the driver. Since, with UDIMM, this is counted by software, it is 1142 possible that some errors could be lost. With RDIMM's, they display the 1143 contents of the registers 1144 1145Reference documents used on ``amd64_edac`` 1146------------------------------------------ 1147 1148``amd64_edac`` module is based on the following documents 1149(available from http://support.amd.com/en-us/search/tech-docs): 1150 11511. :Title: BIOS and Kernel Developer's Guide for AMD Athlon 64 and AMD 1152 Opteron Processors 1153 :AMD publication #: 26094 1154 :Revision: 3.26 1155 :Link: http://support.amd.com/TechDocs/26094.PDF 1156 11572. :Title: BIOS and Kernel Developer's Guide for AMD NPT Family 0Fh 1158 Processors 1159 :AMD publication #: 32559 1160 :Revision: 3.00 1161 :Issue Date: May 2006 1162 :Link: http://support.amd.com/TechDocs/32559.pdf 1163 11643. :Title: BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h 1165 Processors 1166 :AMD publication #: 31116 1167 :Revision: 3.00 1168 :Issue Date: September 07, 2007 1169 :Link: http://support.amd.com/TechDocs/31116.pdf 1170 11714. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h 1172 Models 30h-3Fh Processors 1173 :AMD publication #: 49125 1174 :Revision: 3.06 1175 :Issue Date: 2/12/2015 (latest release) 1176 :Link: http://support.amd.com/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf 1177 11785. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h 1179 Models 60h-6Fh Processors 1180 :AMD publication #: 50742 1181 :Revision: 3.01 1182 :Issue Date: 7/23/2015 (latest release) 1183 :Link: http://support.amd.com/TechDocs/50742_15h_Models_60h-6Fh_BKDG.pdf 1184 11856. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 16h 1186 Models 00h-0Fh Processors 1187 :AMD publication #: 48751 1188 :Revision: 3.03 1189 :Issue Date: 2/23/2015 (latest release) 1190 :Link: http://support.amd.com/TechDocs/48751_16h_bkdg.pdf 1191 1192Credits 1193======= 1194 1195* Written by Doug Thompson <dougthompson@xmission.com> 1196 1197 - 7 Dec 2005 1198 - 17 Jul 2007 Updated 1199 1200* |copy| Mauro Carvalho Chehab 1201 1202 - 05 Aug 2009 Nehalem interface 1203 - 26 Oct 2016 Converted to ReST and cleanups at the Nehalem section 1204 1205* EDAC authors/maintainers: 1206 1207 - Doug Thompson, Dave Jiang, Dave Peterson et al, 1208 - Mauro Carvalho Chehab 1209 - Borislav Petkov 1210 - original author: Thayne Harbaugh 1211