1.. include:: <isonum.txt> 2 3============================================ 4Reliability, Availability and Serviceability 5============================================ 6 7RAS concepts 8************ 9 10Reliability, Availability and Serviceability (RAS) is a concept used on 11servers meant to measure their robusteness. 12 13Reliability 14 is the probability that a system will produce correct outputs. 15 16 * Generally measured as Mean Time Between Failures (MTBF) 17 * Enhanced by features that help to avoid, detect and repair hardware faults 18 19Availability 20 is the probability that a system is operational at a given time 21 22 * Generally measured as a percentage of downtime per a period of time 23 * Often uses mechanisms to detect and correct hardware faults in 24 runtime; 25 26Serviceability (or maintainability) 27 is the simplicity and speed with which a system can be repaired or 28 maintained 29 30 * Generally measured on Mean Time Between Repair (MTBR) 31 32Improving RAS 33------------- 34 35In order to reduce systems downtime, a system should be capable of detecting 36hardware errors, and, when possible correcting them in runtime. It should 37also provide mechanisms to detect hardware degradation, in order to warn 38the system administrator to take the action of replacing a component before 39it causes data loss or system downtime. 40 41Among the monitoring measures, the most usual ones include: 42 43* CPU – detect errors at instruction execution and at L1/L2/L3 caches; 44* Memory – add error correction logic (ECC) to detect and correct errors; 45* I/O – add CRC checksums for tranfered data; 46* Storage – RAID, journal file systems, checksums, 47 Self-Monitoring, Analysis and Reporting Technology (SMART). 48 49By monitoring the number of occurrences of error detections, it is possible 50to identify if the probability of hardware errors is increasing, and, on such 51case, do a preventive maintainance to replace a degrated component while 52those errors are correctable. 53 54Types of errors 55--------------- 56 57Most mechanisms used on modern systems use use technologies like Hamming 58Codes that allow error correction when the number of errors on a bit packet 59is below a threshold. If the number of errors is above, those mechanisms 60can indicate with a high degree of confidence that an error happened, but 61they can't correct. 62 63Also, sometimes an error occur on a component that it is not used. For 64example, a part of the memory that it is not currently allocated. 65 66That defines some categories of errors: 67 68* **Correctable Error (CE)** - the error detection mechanism detected and 69 corrected the error. Such errors are usually not fatal, although some 70 Kernel mechanisms allow the system administrator to consider them as fatal. 71 72* **Uncorrected Error (UE)** - the amount of errors happened above the error 73 correction threshold, and the system was unable to auto-correct. 74 75* **Fatal Error** - when an UE error happens on a critical component of the 76 system (for example, a piece of the Kernel got corrupted by an UE), the 77 only reliable way to avoid data corruption is to hang or reboot the machine. 78 79* **Non-fatal Error** - when an UE error happens on an unused component, 80 like a CPU in power down state or an unused memory bank, the system may 81 still run, eventually replacing the affected hardware by a hot spare, 82 if available. 83 84 Also, when an error happens on an userspace process, it is also possible to 85 kill such process and let userspace restart it. 86 87The mechanism for handling non-fatal errors is usually complex and may 88require the help of some userspace application, in order to apply the 89policy desired by the system administrator. 90 91Identifying a bad hardware component 92------------------------------------ 93 94Just detecting a hardware flaw is usually not enough, as the system needs 95to pinpoint to the minimal replaceable unit (MRU) that should be exchanged 96to make the hardware reliable again. 97 98So, it requires not only error logging facilities, but also mechanisms that 99will translate the error message to the silkscreen or component label for 100the MRU. 101 102Typically, it is very complex for memory, as modern CPUs interlace memory 103from different memory modules, in order to provide a better performance. The 104DMI BIOS usually have a list of memory module labels, with can be obtained 105using the ``dmidecode`` tool. For example, on a desktop machine, it shows:: 106 107 Memory Device 108 Total Width: 64 bits 109 Data Width: 64 bits 110 Size: 16384 MB 111 Form Factor: SODIMM 112 Set: None 113 Locator: ChannelA-DIMM0 114 Bank Locator: BANK 0 115 Type: DDR4 116 Type Detail: Synchronous 117 Speed: 2133 MHz 118 Rank: 2 119 Configured Clock Speed: 2133 MHz 120 121On the above example, a DDR4 SO-DIMM memory module is located at the 122system's memory labeled as "BANK 0", as given by the *bank locator* field. 123Please notice that, on such system, the *total width* is equal to the 124*data witdh*. It means that such memory module doesn't have error 125detection/correction mechanisms. 126 127Unfortunately, not all systems use the same field to specify the memory 128bank. On this example, from an older server, ``dmidecode`` shows:: 129 130 Memory Device 131 Array Handle: 0x1000 132 Error Information Handle: Not Provided 133 Total Width: 72 bits 134 Data Width: 64 bits 135 Size: 8192 MB 136 Form Factor: DIMM 137 Set: 1 138 Locator: DIMM_A1 139 Bank Locator: Not Specified 140 Type: DDR3 141 Type Detail: Synchronous Registered (Buffered) 142 Speed: 1600 MHz 143 Rank: 2 144 Configured Clock Speed: 1600 MHz 145 146There, the DDR3 RDIMM memory module is located at the system's memory labeled 147as "DIMM_A1", as given by the *locator* field. Please notice that this 148memory module has 64 bits of *data witdh* and 72 bits of *total width*. So, 149it has 8 extra bits to be used by error detection and correction mechanisms. 150Such kind of memory is called Error-correcting code memory (ECC memory). 151 152To make things even worse, it is not uncommon that systems with different 153labels on their system's board to use exactly the same BIOS, meaning that 154the labels provided by the BIOS won't match the real ones. 155 156ECC memory 157---------- 158 159As mentioned on the previous section, ECC memory has extra bits to be 160used for error correction. So, on 64 bit systems, a memory module 161has 64 bits of *data width*, and 74 bits of *total width*. So, there are 1628 bits extra bits to be used for the error detection and correction 163mechanisms. Those extra bits are called *syndrome*\ [#f1]_\ [#f2]_. 164 165So, when the cpu requests the memory controller to write a word with 166*data width*, the memory controller calculates the *syndrome* in real time, 167using Hamming code, or some other error correction code, like SECDED+, 168producing a code with *total width* size. Such code is then written 169on the memory modules. 170 171At read, the *total width* bits code is converted back, using the same 172ECC code used on write, producing a word with *data width* and a *syndrome*. 173The word with *data width* is sent to the CPU, even when errors happen. 174 175The memory controller also looks at the *syndrome* in order to check if 176there was an error, and if the ECC code was able to fix such error. 177If the error was corrected, a Corrected Error (CE) happened. If not, an 178Uncorrected Error (UE) happened. 179 180The information about the CE/UE errors is stored on some special registers 181at the memory controller and can be accessed by reading such registers, 182either by BIOS, by some special CPUs or by Linux EDAC driver. On x86 64 183bit CPUs, such errors can also be retrieved via the Machine Check 184Architecture (MCA)\ [#f3]_. 185 186.. [#f1] Please notice that several memory controllers allow operation on a 187 mode called "Lock-Step", where it groups two memory modules together, 188 doing 128-bit reads/writes. That gives 16 bits for error correction, with 189 significatively improves the error correction mechanism, at the expense 190 that, when an error happens, there's no way to know what memory module is 191 to blame. So, it has to blame both memory modules. 192 193.. [#f2] Some memory controllers also allow using memory in mirror mode. 194 On such mode, the same data is written to two memory modules. At read, 195 the system checks both memory modules, in order to check if both provide 196 identical data. On such configuration, when an error happens, there's no 197 way to know what memory module is to blame. So, it has to blame both 198 memory modules (or 4 memory modules, if the system is also on Lock-step 199 mode). 200 201.. [#f3] For more details about the Machine Check Architecture (MCA), 202 please read Documentation/x86/x86_64/machinecheck at the Kernel tree. 203 204EDAC - Error Detection And Correction 205************************************* 206 207.. note:: 208 209 "bluesmoke" was the name for this device driver subsystem when it 210 was "out-of-tree" and maintained at http://bluesmoke.sourceforge.net. 211 That site is mostly archaic now and can be used only for historical 212 purposes. 213 214 When the subsystem was pushed upstream for the first time, on 215 Kernel 2.6.16, for the first time, it was renamed to ``EDAC``. 216 217Purpose 218------- 219 220The ``edac`` kernel module's goal is to detect and report hardware errors 221that occur within the computer system running under linux. 222 223Memory 224------ 225 226Memory Correctable Errors (CE) and Uncorrectable Errors (UE) are the 227primary errors being harvested. These types of errors are harvested by 228the ``edac_mc`` device. 229 230Detecting CE events, then harvesting those events and reporting them, 231**can** but must not necessarily be a predictor of future UE events. With 232CE events only, the system can and will continue to operate as no data 233has been damaged yet. 234 235However, preventive maintenance and proactive part replacement of memory 236modules exhibiting CEs can reduce the likelihood of the dreaded UE events 237and system panics. 238 239Other hardware elements 240----------------------- 241 242A new feature for EDAC, the ``edac_device`` class of device, was added in 243the 2.6.23 version of the kernel. 244 245This new device type allows for non-memory type of ECC hardware detectors 246to have their states harvested and presented to userspace via the sysfs 247interface. 248 249Some architectures have ECC detectors for L1, L2 and L3 caches, 250along with DMA engines, fabric switches, main data path switches, 251interconnections, and various other hardware data paths. If the hardware 252reports it, then a edac_device device probably can be constructed to 253harvest and present that to userspace. 254 255 256PCI bus scanning 257---------------- 258 259In addition, PCI devices are scanned for PCI Bus Parity and SERR Errors 260in order to determine if errors are occurring during data transfers. 261 262The presence of PCI Parity errors must be examined with a grain of salt. 263There are several add-in adapters that do **not** follow the PCI specification 264with regards to Parity generation and reporting. The specification says 265the vendor should tie the parity status bits to 0 if they do not intend 266to generate parity. Some vendors do not do this, and thus the parity bit 267can "float" giving false positives. 268 269There is a PCI device attribute located in sysfs that is checked by 270the EDAC PCI scanning code. If that attribute is set, PCI parity/error 271scanning is skipped for that device. The attribute is:: 272 273 broken_parity_status 274 275and is located in ``/sys/devices/pci<XXX>/0000:XX:YY.Z`` directories for 276PCI devices. 277 278 279Versioning 280---------- 281 282EDAC is composed of a "core" module (``edac_core.ko``) and several Memory 283Controller (MC) driver modules. On a given system, the CORE is loaded 284and one MC driver will be loaded. Both the CORE and the MC driver (or 285``edac_device`` driver) have individual versions that reflect current 286release level of their respective modules. 287 288Thus, to "report" on what version a system is running, one must report 289both the CORE's and the MC driver's versions. 290 291 292Loading 293------- 294 295If ``edac`` was statically linked with the kernel then no loading 296is necessary. If ``edac`` was built as modules then simply modprobe 297the ``edac`` pieces that you need. You should be able to modprobe 298hardware-specific modules and have the dependencies load the necessary 299core modules. 300 301Example:: 302 303 $ modprobe amd76x_edac 304 305loads both the ``amd76x_edac.ko`` memory controller module and the 306``edac_mc.ko`` core module. 307 308 309Sysfs interface 310--------------- 311 312EDAC presents a ``sysfs`` interface for control and reporting purposes. It 313lives in the /sys/devices/system/edac directory. 314 315Within this directory there currently reside 2 components: 316 317 ======= ============================== 318 mc memory controller(s) system 319 pci PCI control and status system 320 ======= ============================== 321 322 323 324Memory Controller (mc) Model 325---------------------------- 326 327Each ``mc`` device controls a set of memory modules [#f4]_. These modules 328are laid out in a Chip-Select Row (``csrowX``) and Channel table (``chX``). 329There can be multiple csrows and multiple channels. 330 331.. [#f4] Nowadays, the term DIMM (Dual In-line Memory Module) is widely 332 used to refer to a memory module, although there are other memory 333 packaging alternatives, like SO-DIMM, SIMM, etc. Along this document, 334 and inside the EDAC system, the term "dimm" is used for all memory 335 modules, even when they use a different kind of packaging. 336 337Memory controllers allow for several csrows, with 8 csrows being a 338typical value. Yet, the actual number of csrows depends on the layout of 339a given motherboard, memory controller and memory module characteristics. 340 341Dual channels allow for dual data length (e. g. 128 bits, on 64 bit systems) 342data transfers to/from the CPU from/to memory. Some newer chipsets allow 343for more than 2 channels, like Fully Buffered DIMMs (FB-DIMMs) memory 344controllers. The following example will assume 2 channels: 345 346 +------------+-----------------------+ 347 | Chip | Channels | 348 | Select +-----------+-----------+ 349 | rows | ``ch0`` | ``ch1`` | 350 +============+===========+===========+ 351 | ``csrow0`` | DIMM_A0 | DIMM_B0 | 352 +------------+ | | 353 | ``csrow1`` | | | 354 +------------+-----------+-----------+ 355 | ``csrow2`` | DIMM_A1 | DIMM_B1 | 356 +------------+ | | 357 | ``csrow3`` | | | 358 +------------+-----------+-----------+ 359 360In the above example, there are 4 physical slots on the motherboard 361for memory DIMMs: 362 363 +---------+---------+ 364 | DIMM_A0 | DIMM_B0 | 365 +---------+---------+ 366 | DIMM_A1 | DIMM_B1 | 367 +---------+---------+ 368 369Labels for these slots are usually silk-screened on the motherboard. 370Slots labeled ``A`` are channel 0 in this example. Slots labeled ``B`` are 371channel 1. Notice that there are two csrows possible on a physical DIMM. 372These csrows are allocated their csrow assignment based on the slot into 373which the memory DIMM is placed. Thus, when 1 DIMM is placed in each 374Channel, the csrows cross both DIMMs. 375 376Memory DIMMs come single or dual "ranked". A rank is a populated csrow. 377Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above 378will have just one csrow (csrow0). csrow1 will be empty. On the other 379hand, when 2 dual ranked DIMMs are similarly placed, then both csrow0 380and csrow1 will be populated. The pattern repeats itself for csrow2 and 381csrow3. 382 383The representation of the above is reflected in the directory 384tree in EDAC's sysfs interface. Starting in directory 385``/sys/devices/system/edac/mc``, each memory controller will be 386represented by its own ``mcX`` directory, where ``X`` is the 387index of the MC:: 388 389 ..../edac/mc/ 390 | 391 |->mc0 392 |->mc1 393 |->mc2 394 .... 395 396Under each ``mcX`` directory each ``csrowX`` is again represented by a 397``csrowX``, where ``X`` is the csrow index:: 398 399 .../mc/mc0/ 400 | 401 |->csrow0 402 |->csrow2 403 |->csrow3 404 .... 405 406Notice that there is no csrow1, which indicates that csrow0 is composed 407of a single ranked DIMMs. This should also apply in both Channels, in 408order to have dual-channel mode be operational. Since both csrow2 and 409csrow3 are populated, this indicates a dual ranked set of DIMMs for 410channels 0 and 1. 411 412Within each of the ``mcX`` and ``csrowX`` directories are several EDAC 413control and attribute files. 414 415``mcX`` directories 416------------------- 417 418In ``mcX`` directories are EDAC control and attribute files for 419this ``X`` instance of the memory controllers. 420 421For a description of the sysfs API, please see: 422 423 Documentation/ABI/testing/sysfs-devices-edac 424 425 426``dimmX`` or ``rankX`` directories 427---------------------------------- 428 429The recommended way to use the EDAC subsystem is to look at the information 430provided by the ``dimmX`` or ``rankX`` directories [#f5]_. 431 432A typical EDAC system has the following structure under 433``/sys/devices/system/edac/``\ [#f6]_:: 434 435 /sys/devices/system/edac/ 436 ├── mc 437 │ ├── mc0 438 │ │ ├── ce_count 439 │ │ ├── ce_noinfo_count 440 │ │ ├── dimm0 441 │ │ │ ├── dimm_dev_type 442 │ │ │ ├── dimm_edac_mode 443 │ │ │ ├── dimm_label 444 │ │ │ ├── dimm_location 445 │ │ │ ├── dimm_mem_type 446 │ │ │ ├── size 447 │ │ │ └── uevent 448 │ │ ├── max_location 449 │ │ ├── mc_name 450 │ │ ├── reset_counters 451 │ │ ├── seconds_since_reset 452 │ │ ├── size_mb 453 │ │ ├── ue_count 454 │ │ ├── ue_noinfo_count 455 │ │ └── uevent 456 │ ├── mc1 457 │ │ ├── ce_count 458 │ │ ├── ce_noinfo_count 459 │ │ ├── dimm0 460 │ │ │ ├── dimm_dev_type 461 │ │ │ ├── dimm_edac_mode 462 │ │ │ ├── dimm_label 463 │ │ │ ├── dimm_location 464 │ │ │ ├── dimm_mem_type 465 │ │ │ ├── size 466 │ │ │ └── uevent 467 │ │ ├── max_location 468 │ │ ├── mc_name 469 │ │ ├── reset_counters 470 │ │ ├── seconds_since_reset 471 │ │ ├── size_mb 472 │ │ ├── ue_count 473 │ │ ├── ue_noinfo_count 474 │ │ └── uevent 475 │ └── uevent 476 └── uevent 477 478In the ``dimmX`` directories are EDAC control and attribute files for 479this ``X`` memory module: 480 481- ``size`` - Total memory managed by this csrow attribute file 482 483 This attribute file displays, in count of megabytes, the memory 484 that this csrow contains. 485 486- ``dimm_dev_type`` - Device type attribute file 487 488 This attribute file will display what type of DRAM device is 489 being utilized on this DIMM. 490 Examples: 491 492 - x1 493 - x2 494 - x4 495 - x8 496 497- ``dimm_edac_mode`` - EDAC Mode of operation attribute file 498 499 This attribute file will display what type of Error detection 500 and correction is being utilized. 501 502- ``dimm_label`` - memory module label control file 503 504 This control file allows this DIMM to have a label assigned 505 to it. With this label in the module, when errors occur 506 the output can provide the DIMM label in the system log. 507 This becomes vital for panic events to isolate the 508 cause of the UE event. 509 510 DIMM Labels must be assigned after booting, with information 511 that correctly identifies the physical slot with its 512 silk screen label. This information is currently very 513 motherboard specific and determination of this information 514 must occur in userland at this time. 515 516- ``dimm_location`` - location of the memory module 517 518 The location can have up to 3 levels, and describe how the 519 memory controller identifies the location of a memory module. 520 Depending on the type of memory and memory controller, it 521 can be: 522 523 - *csrow* and *channel* - used when the memory controller 524 doesn't identify a single DIMM - e. g. in ``rankX`` dir; 525 - *branch*, *channel*, *slot* - typically used on FB-DIMM memory 526 controllers; 527 - *channel*, *slot* - used on Nehalem and newer Intel drivers. 528 529- ``dimm_mem_type`` - Memory Type attribute file 530 531 This attribute file will display what type of memory is currently 532 on this csrow. Normally, either buffered or unbuffered memory. 533 Examples: 534 535 - Registered-DDR 536 - Unbuffered-DDR 537 538.. [#f5] On some systems, the memory controller doesn't have any logic 539 to identify the memory module. On such systems, the directory is called ``rankX`` and works on a similar way as the ``csrowX`` directories. 540 On modern Intel memory controllers, the memory controller identifies the 541 memory modules directly. On such systems, the directory is called ``dimmX``. 542 543.. [#f6] There are also some ``power`` directories and ``subsystem`` 544 symlinks inside the sysfs mapping that are automatically created by 545 the sysfs subsystem. Currently, they serve no purpose. 546 547``csrowX`` directories 548---------------------- 549 550When CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the ``csrowX`` 551directories. As this API doesn't work properly for Rambus, FB-DIMMs and 552modern Intel Memory Controllers, this is being deprecated in favor of 553``dimmX`` directories. 554 555In the ``csrowX`` directories are EDAC control and attribute files for 556this ``X`` instance of csrow: 557 558 559- ``ue_count`` - Total Uncorrectable Errors count attribute file 560 561 This attribute file displays the total count of uncorrectable 562 errors that have occurred on this csrow. If panic_on_ue is set 563 this counter will not have a chance to increment, since EDAC 564 will panic the system. 565 566 567- ``ce_count`` - Total Correctable Errors count attribute file 568 569 This attribute file displays the total count of correctable 570 errors that have occurred on this csrow. This count is very 571 important to examine. CEs provide early indications that a 572 DIMM is beginning to fail. This count field should be 573 monitored for non-zero values and report such information 574 to the system administrator. 575 576 577- ``size_mb`` - Total memory managed by this csrow attribute file 578 579 This attribute file displays, in count of megabytes, the memory 580 that this csrow contains. 581 582 583- ``mem_type`` - Memory Type attribute file 584 585 This attribute file will display what type of memory is currently 586 on this csrow. Normally, either buffered or unbuffered memory. 587 Examples: 588 589 - Registered-DDR 590 - Unbuffered-DDR 591 592 593- ``edac_mode`` - EDAC Mode of operation attribute file 594 595 This attribute file will display what type of Error detection 596 and correction is being utilized. 597 598 599- ``dev_type`` - Device type attribute file 600 601 This attribute file will display what type of DRAM device is 602 being utilized on this DIMM. 603 Examples: 604 605 - x1 606 - x2 607 - x4 608 - x8 609 610 611- ``ch0_ce_count`` - Channel 0 CE Count attribute file 612 613 This attribute file will display the count of CEs on this 614 DIMM located in channel 0. 615 616 617- ``ch0_ue_count`` - Channel 0 UE Count attribute file 618 619 This attribute file will display the count of UEs on this 620 DIMM located in channel 0. 621 622 623- ``ch0_dimm_label`` - Channel 0 DIMM Label control file 624 625 626 This control file allows this DIMM to have a label assigned 627 to it. With this label in the module, when errors occur 628 the output can provide the DIMM label in the system log. 629 This becomes vital for panic events to isolate the 630 cause of the UE event. 631 632 DIMM Labels must be assigned after booting, with information 633 that correctly identifies the physical slot with its 634 silk screen label. This information is currently very 635 motherboard specific and determination of this information 636 must occur in userland at this time. 637 638 639- ``ch1_ce_count`` - Channel 1 CE Count attribute file 640 641 642 This attribute file will display the count of CEs on this 643 DIMM located in channel 1. 644 645 646- ``ch1_ue_count`` - Channel 1 UE Count attribute file 647 648 649 This attribute file will display the count of UEs on this 650 DIMM located in channel 0. 651 652 653- ``ch1_dimm_label`` - Channel 1 DIMM Label control file 654 655 This control file allows this DIMM to have a label assigned 656 to it. With this label in the module, when errors occur 657 the output can provide the DIMM label in the system log. 658 This becomes vital for panic events to isolate the 659 cause of the UE event. 660 661 DIMM Labels must be assigned after booting, with information 662 that correctly identifies the physical slot with its 663 silk screen label. This information is currently very 664 motherboard specific and determination of this information 665 must occur in userland at this time. 666 667 668System Logging 669-------------- 670 671If logging for UEs and CEs is enabled, then system logs will contain 672information indicating that errors have been detected:: 673 674 EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, channel 1 "DIMM_B1": amd76x_edac 675 EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0, channel 1 "DIMM_B1": amd76x_edac 676 677 678The structure of the message is: 679 680 +---------------------------------------+-------------+ 681 | Content + Example | 682 +=======================================+=============+ 683 | The memory controller | MC0 | 684 +---------------------------------------+-------------+ 685 | Error type | CE | 686 +---------------------------------------+-------------+ 687 | Memory page | 0x283 | 688 +---------------------------------------+-------------+ 689 | Offset in the page | 0xce0 | 690 +---------------------------------------+-------------+ 691 | The byte granularity | grain 8 | 692 | or resolution of the error | | 693 +---------------------------------------+-------------+ 694 | The error syndrome | 0xb741 | 695 +---------------------------------------+-------------+ 696 | Memory row | row 0 + 697 +---------------------------------------+-------------+ 698 | Memory channel | channel 1 | 699 +---------------------------------------+-------------+ 700 | DIMM label, if set prior | DIMM B1 | 701 +---------------------------------------+-------------+ 702 | And then an optional, driver-specific | | 703 | message that may have additional | | 704 | information. | | 705 +---------------------------------------+-------------+ 706 707Both UEs and CEs with no info will lack all but memory controller, error 708type, a notice of "no info" and then an optional, driver-specific error 709message. 710 711 712PCI Bus Parity Detection 713------------------------ 714 715On Header Type 00 devices, the primary status is looked at for any 716parity error regardless of whether parity is enabled on the device or 717not. (The spec indicates parity is generated in some cases). On Header 718Type 01 bridges, the secondary status register is also looked at to see 719if parity occurred on the bus on the other side of the bridge. 720 721 722Sysfs configuration 723------------------- 724 725Under ``/sys/devices/system/edac/pci`` are control and attribute files as 726follows: 727 728 729- ``check_pci_parity`` - Enable/Disable PCI Parity checking control file 730 731 This control file enables or disables the PCI Bus Parity scanning 732 operation. Writing a 1 to this file enables the scanning. Writing 733 a 0 to this file disables the scanning. 734 735 Enable:: 736 737 echo "1" >/sys/devices/system/edac/pci/check_pci_parity 738 739 Disable:: 740 741 echo "0" >/sys/devices/system/edac/pci/check_pci_parity 742 743 744- ``pci_parity_count`` - Parity Count 745 746 This attribute file will display the number of parity errors that 747 have been detected. 748 749 750Module parameters 751----------------- 752 753- ``edac_mc_panic_on_ue`` - Panic on UE control file 754 755 An uncorrectable error will cause a machine panic. This is usually 756 desirable. It is a bad idea to continue when an uncorrectable error 757 occurs - it is indeterminate what was uncorrected and the operating 758 system context might be so mangled that continuing will lead to further 759 corruption. If the kernel has MCE configured, then EDAC will never 760 notice the UE. 761 762 LOAD TIME:: 763 764 module/kernel parameter: edac_mc_panic_on_ue=[0|1] 765 766 RUN TIME:: 767 768 echo "1" > /sys/module/edac_core/parameters/edac_mc_panic_on_ue 769 770 771- ``edac_mc_log_ue`` - Log UE control file 772 773 774 Generate kernel messages describing uncorrectable errors. These errors 775 are reported through the system message log system. UE statistics 776 will be accumulated even when UE logging is disabled. 777 778 LOAD TIME:: 779 780 module/kernel parameter: edac_mc_log_ue=[0|1] 781 782 RUN TIME:: 783 784 echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ue 785 786 787- ``edac_mc_log_ce`` - Log CE control file 788 789 790 Generate kernel messages describing correctable errors. These 791 errors are reported through the system message log system. 792 CE statistics will be accumulated even when CE logging is disabled. 793 794 LOAD TIME:: 795 796 module/kernel parameter: edac_mc_log_ce=[0|1] 797 798 RUN TIME:: 799 800 echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ce 801 802 803- ``edac_mc_poll_msec`` - Polling period control file 804 805 806 The time period, in milliseconds, for polling for error information. 807 Too small a value wastes resources. Too large a value might delay 808 necessary handling of errors and might loose valuable information for 809 locating the error. 1000 milliseconds (once each second) is the current 810 default. Systems which require all the bandwidth they can get, may 811 increase this. 812 813 LOAD TIME:: 814 815 module/kernel parameter: edac_mc_poll_msec=[0|1] 816 817 RUN TIME:: 818 819 echo "1000" > /sys/module/edac_core/parameters/edac_mc_poll_msec 820 821 822- ``panic_on_pci_parity`` - Panic on PCI PARITY Error 823 824 825 This control file enables or disables panicking when a parity 826 error has been detected. 827 828 829 module/kernel parameter:: 830 831 edac_panic_on_pci_pe=[0|1] 832 833 Enable:: 834 835 echo "1" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe 836 837 Disable:: 838 839 echo "0" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe 840 841 842 843EDAC device type 844---------------- 845 846In the header file, edac_pci.h, there is a series of edac_device structures 847and APIs for the EDAC_DEVICE. 848 849User space access to an edac_device is through the sysfs interface. 850 851At the location ``/sys/devices/system/edac`` (sysfs) new edac_device devices 852will appear. 853 854There is a three level tree beneath the above ``edac`` directory. For example, 855the ``test_device_edac`` device (found at the http://bluesmoke.sourceforget.net 856website) installs itself as:: 857 858 /sys/devices/system/edac/test-instance 859 860in this directory are various controls, a symlink and one or more ``instance`` 861directories. 862 863The standard default controls are: 864 865 ============== ======================================================= 866 log_ce boolean to log CE events 867 log_ue boolean to log UE events 868 panic_on_ue boolean to ``panic`` the system if an UE is encountered 869 (default off, can be set true via startup script) 870 poll_msec time period between POLL cycles for events 871 ============== ======================================================= 872 873The test_device_edac device adds at least one of its own custom control: 874 875 ============== ================================================== 876 test_bits which in the current test driver does nothing but 877 show how it is installed. A ported driver can 878 add one or more such controls and/or attributes 879 for specific uses. 880 One out-of-tree driver uses controls here to allow 881 for ERROR INJECTION operations to hardware 882 injection registers 883 ============== ================================================== 884 885The symlink points to the 'struct dev' that is registered for this edac_device. 886 887Instances 888--------- 889 890One or more instance directories are present. For the ``test_device_edac`` 891case: 892 893 +----------------+ 894 | test-instance0 | 895 +----------------+ 896 897 898In this directory there are two default counter attributes, which are totals of 899counter in deeper subdirectories. 900 901 ============== ==================================== 902 ce_count total of CE events of subdirectories 903 ue_count total of UE events of subdirectories 904 ============== ==================================== 905 906Blocks 907------ 908 909At the lowest directory level is the ``block`` directory. There can be 0, 1 910or more blocks specified in each instance: 911 912 +-------------+ 913 | test-block0 | 914 +-------------+ 915 916In this directory the default attributes are: 917 918 ============== ================================================ 919 ce_count which is counter of CE events for this ``block`` 920 of hardware being monitored 921 ue_count which is counter of UE events for this ``block`` 922 of hardware being monitored 923 ============== ================================================ 924 925 926The ``test_device_edac`` device adds 4 attributes and 1 control: 927 928 ================== ==================================================== 929 test-block-bits-0 for every POLL cycle this counter 930 is incremented 931 test-block-bits-1 every 10 cycles, this counter is bumped once, 932 and test-block-bits-0 is set to 0 933 test-block-bits-2 every 100 cycles, this counter is bumped once, 934 and test-block-bits-1 is set to 0 935 test-block-bits-3 every 1000 cycles, this counter is bumped once, 936 and test-block-bits-2 is set to 0 937 ================== ==================================================== 938 939 940 ================== ==================================================== 941 reset-counters writing ANY thing to this control will 942 reset all the above counters. 943 ================== ==================================================== 944 945 946Use of the ``test_device_edac`` driver should enable any others to create their own 947unique drivers for their hardware systems. 948 949The ``test_device_edac`` sample driver is located at the 950http://bluesmoke.sourceforge.net project site for EDAC. 951 952 953Usage of EDAC APIs on Nehalem and newer Intel CPUs 954-------------------------------------------------- 955 956On older Intel architectures, the memory controller was part of the North 957Bridge chipset. Nehalem, Sandy Bridge, Ivy Bridge, Haswell, Sky Lake and 958newer Intel architectures integrated an enhanced version of the memory 959controller (MC) inside the CPUs. 960 961This chapter will cover the differences of the enhanced memory controllers 962found on newer Intel CPUs, such as ``i7core_edac``, ``sb_edac`` and 963``sbx_edac`` drivers. 964 965.. note:: 966 967 The Xeon E7 processor families use a separate chip for the memory 968 controller, called Intel Scalable Memory Buffer. This section doesn't 969 apply for such families. 970 9711) There is one Memory Controller per Quick Patch Interconnect 972 (QPI). At the driver, the term "socket" means one QPI. This is 973 associated with a physical CPU socket. 974 975 Each MC have 3 physical read channels, 3 physical write channels and 976 3 logic channels. The driver currently sees it as just 3 channels. 977 Each channel can have up to 3 DIMMs. 978 979 The minimum known unity is DIMMs. There are no information about csrows. 980 As EDAC API maps the minimum unity is csrows, the driver sequentially 981 maps channel/DIMM into different csrows. 982 983 For example, supposing the following layout:: 984 985 Ch0 phy rd0, wr0 (0x063f4031): 2 ranks, UDIMMs 986 dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 987 dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400 988 Ch1 phy rd1, wr1 (0x063f4031): 2 ranks, UDIMMs 989 dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 990 Ch2 phy rd3, wr3 (0x063f4031): 2 ranks, UDIMMs 991 dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 992 993 The driver will map it as:: 994 995 csrow0: channel 0, dimm0 996 csrow1: channel 0, dimm1 997 csrow2: channel 1, dimm0 998 csrow3: channel 2, dimm0 999 1000 exports one DIMM per csrow. 1001 1002 Each QPI is exported as a different memory controller. 1003 10042) The MC has the ability to inject errors to test drivers. The drivers 1005 implement this functionality via some error injection nodes: 1006 1007 For injecting a memory error, there are some sysfs nodes, under 1008 ``/sys/devices/system/edac/mc/mc?/``: 1009 1010 - ``inject_addrmatch/*``: 1011 Controls the error injection mask register. It is possible to specify 1012 several characteristics of the address to match an error code:: 1013 1014 dimm = the affected dimm. Numbers are relative to a channel; 1015 rank = the memory rank; 1016 channel = the channel that will generate an error; 1017 bank = the affected bank; 1018 page = the page address; 1019 column (or col) = the address column. 1020 1021 each of the above values can be set to "any" to match any valid value. 1022 1023 At driver init, all values are set to any. 1024 1025 For example, to generate an error at rank 1 of dimm 2, for any channel, 1026 any bank, any page, any column:: 1027 1028 echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm 1029 echo 1 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank 1030 1031 To return to the default behaviour of matching any, you can do:: 1032 1033 echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm 1034 echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank 1035 1036 - ``inject_eccmask``: 1037 specifies what bits will have troubles, 1038 1039 - ``inject_section``: 1040 specifies what ECC cache section will get the error:: 1041 1042 3 for both 1043 2 for the highest 1044 1 for the lowest 1045 1046 - ``inject_type``: 1047 specifies the type of error, being a combination of the following bits:: 1048 1049 bit 0 - repeat 1050 bit 1 - ecc 1051 bit 2 - parity 1052 1053 - ``inject_enable``: 1054 starts the error generation when something different than 0 is written. 1055 1056 All inject vars can be read. root permission is needed for write. 1057 1058 Datasheet states that the error will only be generated after a write on an 1059 address that matches inject_addrmatch. It seems, however, that reading will 1060 also produce an error. 1061 1062 For example, the following code will generate an error for any write access 1063 at socket 0, on any DIMM/address on channel 2:: 1064 1065 echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/channel 1066 echo 2 >/sys/devices/system/edac/mc/mc0/inject_type 1067 echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask 1068 echo 3 >/sys/devices/system/edac/mc/mc0/inject_section 1069 echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable 1070 dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null 1071 1072 For socket 1, it is needed to replace "mc0" by "mc1" at the above 1073 commands. 1074 1075 The generated error message will look like:: 1076 1077 EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error)) 1078 10793) Corrected Error memory register counters 1080 1081 Those newer MCs have some registers to count memory errors. The driver 1082 uses those registers to report Corrected Errors on devices with Registered 1083 DIMMs. 1084 1085 However, those counters don't work with Unregistered DIMM. As the chipset 1086 offers some counters that also work with UDIMMs (but with a worse level of 1087 granularity than the default ones), the driver exposes those registers for 1088 UDIMM memories. 1089 1090 They can be read by looking at the contents of ``all_channel_counts/``:: 1091 1092 $ for i in /sys/devices/system/edac/mc/mc0/all_channel_counts/*; do echo $i; cat $i; done 1093 /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm0 1094 0 1095 /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm1 1096 0 1097 /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm2 1098 0 1099 1100 What happens here is that errors on different csrows, but at the same 1101 dimm number will increment the same counter. 1102 So, in this memory mapping:: 1103 1104 csrow0: channel 0, dimm0 1105 csrow1: channel 0, dimm1 1106 csrow2: channel 1, dimm0 1107 csrow3: channel 2, dimm0 1108 1109 The hardware will increment udimm0 for an error at the first dimm at either 1110 csrow0, csrow2 or csrow3; 1111 1112 The hardware will increment udimm1 for an error at the second dimm at either 1113 csrow0, csrow2 or csrow3; 1114 1115 The hardware will increment udimm2 for an error at the third dimm at either 1116 csrow0, csrow2 or csrow3; 1117 11184) Standard error counters 1119 1120 The standard error counters are generated when an mcelog error is received 1121 by the driver. Since, with UDIMM, this is counted by software, it is 1122 possible that some errors could be lost. With RDIMM's, they display the 1123 contents of the registers 1124 1125Reference documents used on ``amd64_edac`` 1126------------------------------------------ 1127 1128``amd64_edac`` module is based on the following documents 1129(available from http://support.amd.com/en-us/search/tech-docs): 1130 11311. :Title: BIOS and Kernel Developer's Guide for AMD Athlon 64 and AMD 1132 Opteron Processors 1133 :AMD publication #: 26094 1134 :Revision: 3.26 1135 :Link: http://support.amd.com/TechDocs/26094.PDF 1136 11372. :Title: BIOS and Kernel Developer's Guide for AMD NPT Family 0Fh 1138 Processors 1139 :AMD publication #: 32559 1140 :Revision: 3.00 1141 :Issue Date: May 2006 1142 :Link: http://support.amd.com/TechDocs/32559.pdf 1143 11443. :Title: BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h 1145 Processors 1146 :AMD publication #: 31116 1147 :Revision: 3.00 1148 :Issue Date: September 07, 2007 1149 :Link: http://support.amd.com/TechDocs/31116.pdf 1150 11514. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h 1152 Models 30h-3Fh Processors 1153 :AMD publication #: 49125 1154 :Revision: 3.06 1155 :Issue Date: 2/12/2015 (latest release) 1156 :Link: http://support.amd.com/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf 1157 11585. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h 1159 Models 60h-6Fh Processors 1160 :AMD publication #: 50742 1161 :Revision: 3.01 1162 :Issue Date: 7/23/2015 (latest release) 1163 :Link: http://support.amd.com/TechDocs/50742_15h_Models_60h-6Fh_BKDG.pdf 1164 11656. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 16h 1166 Models 00h-0Fh Processors 1167 :AMD publication #: 48751 1168 :Revision: 3.03 1169 :Issue Date: 2/23/2015 (latest release) 1170 :Link: http://support.amd.com/TechDocs/48751_16h_bkdg.pdf 1171 1172Credits 1173======= 1174 1175* Written by Doug Thompson <dougthompson@xmission.com> 1176 1177 - 7 Dec 2005 1178 - 17 Jul 2007 Updated 1179 1180* |copy| Mauro Carvalho Chehab 1181 1182 - 05 Aug 2009 Nehalem interface 1183 - 26 Oct 2016 Converted to ReST and cleanups at the Nehalem section 1184 1185* EDAC authors/maintainers: 1186 1187 - Doug Thompson, Dave Jiang, Dave Peterson et al, 1188 - Mauro Carvalho Chehab 1189 - Borislav Petkov 1190 - original author: Thayne Harbaugh 1191