Documentation/admin-guide/ras.rst

33 -------------
44 * Memory – add error correction logic (ECC) to detect and correct errors;
47   Self-Monitoring, Analysis and Reporting Technology (SMART).
49 By monitoring the number of occurrences of error detections, it is possible
55 ---------------
58 Codes that allow error correction when the number of errors on a bit packet
60 can indicate with a high degree of confidence that an error happened, but
63 Also, sometimes an error occur on a component that it is not used. For
68 * **Correctable Error (CE)** - the error detection mechanism detected and
69   corrected the error. Such errors are usually not fatal, although some
72 * **Uncorrected Error (UE)** - the amount of errors happened above the error
73   correction threshold, and the system was unable to auto-correct.
75 * **Fatal Error** - when an UE error happens on a critical component of the
79 * **Non-fatal Error** - when an UE error happens on an unused component,
84   Also, when an error happens on a userspace process, it is also possible to
87 The mechanism for handling non-fatal errors is usually complex and may
92 ------------------------------------
98 So, it requires not only error logging facilities, but also mechanisms that
99 will translate the error message to the silkscreen or component label for
113 		Locator: ChannelA-DIMM0
121 On the above example, a DDR4 SO-DIMM memory module is located at the
124 *data width*. It means that such memory module doesn't have error
125 detection/correction mechanisms.
132 		Error Information Handle: Not Provided
149 it has 8 extra bits to be used by error detection and correction mechanisms.
150 Such kind of memory is called Error-correcting code memory (ECC memory).
157 ----------
160 used for error correction. In the above example, a memory module has
162 bits which are used for the error detection and correction mechanisms
167 using Hamming code, or some other error correction code, like SECDED+,
176 there was an error, and if the ECC code was able to fix such error.
177 If the error was corrected, a Corrected Error (CE) happened. If not, an
178 Uncorrected Error (UE) happened.
187   mode called "Lock-Step", where it groups two memory modules together,
188   doing 128-bit reads/writes. That gives 16 bits for error correction, with
189   significantly improves the error correction mechanism, at the expense
190   that, when an error happens, there's no way to know what memory module is
196   identical data. On such configuration, when an error happens, there's no
198   memory modules (or 4 memory modules, if the system is also on Lock-step
204 EDAC - Error Detection And Correction
210    was "out-of-tree" and maintained at http://bluesmoke.sourceforge.net.
218 -------
224 ------
240 -----------------------
245 This new device type allows for non-memory type of ECC hardware detectors
257 ----------------
263 There are several add-in adapters that do **not** follow the PCI specification
270 the EDAC PCI scanning code. If that attribute is set, PCI parity/error
280 ----------
293 -------
298 hardware-specific modules and have the dependencies load the necessary
310 ---------------
325 ----------------------------
328 are laid out in a Chip-Select Row (``csrowX``) and Channel table (``chX``).
331 .. [#f4] Nowadays, the term DIMM (Dual In-line Memory Module) is widely
333   packaging alternatives, like SO-DIMM, SIMM, etc. The UEFI
335   Platform Error Record (CPER) section to be an SMBIOS Memory Device
346 for more than 2 channels, like Fully Buffered DIMMs (FB-DIMMs) memory
349 	+------------+-----------------------+
351 	+------------+-----------+-----------+
355 	+------------+-----------+-----------+
357 	+------------+-----------+-----------+
359 	+------------+-----------+-----------+
361 	+------------+-----------+-----------+
363 	+------------+-----------+-----------+
365 	+------------+-----------+-----------+
370 	+---------+---------+
372 	+---------+---------+
374 	+---------+---------+
376 Labels for these slots are usually silk-screened on the motherboard.
400 		   |->mc0
401 		   |->mc1
402 		   |->mc2
410 		|->csrow0
411 		|->csrow2
412 		|->csrow3
417 order to have dual-channel mode be operational. Since both csrow2 and
425 -------------------
432 	Documentation/ABI/testing/sysfs-devices-edac
436 ----------------------------------
494 - ``size`` - Total memory managed by this csrow attribute file
499 - ``dimm_ue_count`` - Uncorrectable Errors count attribute file
506 - ``dimm_ce_count`` - Correctable Errors count attribute file
512 	monitored for non-zero values and report such information
515 - ``dimm_dev_type``  - Device type attribute file
521 		- x1
522 		- x2
523 		- x4
524 		- x8
526 - ``dimm_edac_mode`` - EDAC Mode of operation attribute file
528 	This attribute file will display what type of Error detection
529 	and correction is being utilized.
531 - ``dimm_label`` - memory module label control file
545 - ``dimm_location`` - location of the memory module
552 		- *csrow* and *channel* - used when the memory controller
553 		  doesn't identify a single DIMM - e. g. in ``rankX`` dir;
554 		- *branch*, *channel*, *slot* - typically used on FB-DIMM memory
556 		- *channel*, *slot* - used on Nehalem and newer Intel drivers.
558 - ``dimm_mem_type`` - Memory Type attribute file
564 		- Registered-DDR
565 		- Unbuffered-DDR
577 ----------------------
580 directories. As this API doesn't work properly for Rambus, FB-DIMMs and
588 - ``ue_count`` - Total Uncorrectable Errors count attribute file
596 - ``ce_count`` - Total Correctable Errors count attribute file
602 	monitored for non-zero values and report such information
606 - ``size_mb`` - Total memory managed by this csrow attribute file
612 - ``mem_type`` - Memory Type attribute file
618 		- Registered-DDR
619 		- Unbuffered-DDR
622 - ``edac_mode`` - EDAC Mode of operation attribute file
624 	This attribute file will display what type of Error detection
625 	and correction is being utilized.
628 - ``dev_type`` - Device type attribute file
634 		- x1
635 		- x2
636 		- x4
637 		- x8
640 - ``ch0_ce_count`` - Channel 0 CE Count attribute file
646 - ``ch0_ue_count`` - Channel 0 UE Count attribute file
652 - ``ch0_dimm_label`` - Channel 0 DIMM Label control file
668 - ``ch1_ce_count`` - Channel 1 CE Count attribute file
675 - ``ch1_ue_count`` - Channel 1 UE Count attribute file
682 - ``ch1_dimm_label`` - Channel 1 DIMM Label control file
698 --------------
709 	+---------------------------------------+-------------+
713 	+---------------------------------------+-------------+
714 	| Error type                            | CE          |
715 	+---------------------------------------+-------------+
717 	+---------------------------------------+-------------+
719 	+---------------------------------------+-------------+
721 	| or resolution of the error            |             |
722 	+---------------------------------------+-------------+
723 	| The error syndrome                    | 0xb741      |
724 	+---------------------------------------+-------------+
726 	+---------------------------------------+-------------+
728 	+---------------------------------------+-------------+
730 	+---------------------------------------+-------------+
731 	| And then an optional, driver-specific |             |
734 	+---------------------------------------+-------------+
736 Both UEs and CEs with no info will lack all but memory controller, error
737 type, a notice of "no info" and then an optional, driver-specific error
742 ------------------------
745 parity error regardless of whether parity is enabled on the device or
752 -------------------
758 - ``check_pci_parity`` - Enable/Disable PCI Parity checking control file
773 - ``pci_parity_count`` - Parity Count
780 -----------------
782 - ``edac_mc_panic_on_ue`` - Panic on UE control file
784 	An uncorrectable error will cause a machine panic.  This is usually
785 	desirable.  It is a bad idea to continue when an uncorrectable error
786 	occurs - it is indeterminate what was uncorrected and the operating
800 - ``edac_mc_log_ue`` - Log UE control file
816 - ``edac_mc_log_ce`` - Log CE control file
832 - ``edac_mc_poll_msec`` - Polling period control file
835 	The time period, in milliseconds, for polling for error information.
838 	locating the error.  1000 milliseconds (once each second) is the current
851 - ``panic_on_pci_parity`` - Panic on PCI PARITY Error
855 	error has been detected.
873 ----------------
887 	/sys/devices/system/edac/test-instance
909 			One out-of-tree driver uses controls here to allow
910 			for ERROR INJECTION operations to hardware
917 ---------
922 	+----------------+
923 	| test-instance0 |
924 	+----------------+
936 ------
941 	+-------------+
942 	| test-block0 |
943 	+-------------+
958 	test-block-bits-0	for every POLL cycle this counter
960 	test-block-bits-1	every 10 cycles, this counter is bumped once,
961 				and test-block-bits-0 is set to 0
962 	test-block-bits-2	every 100 cycles, this counter is bumped once,
963 				and test-block-bits-1 is set to 0
964 	test-block-bits-3	every 1000 cycles, this counter is bumped once,
965 				and test-block-bits-2 is set to 0
970 	reset-counters		writing ANY thing to this control will
983 --------------------------------------------------
1034    implement this functionality via some error injection nodes:
1036    For injecting a memory error, there are some sysfs nodes, under
1039    - ``inject_addrmatch/*``:
1040       Controls the error injection mask register. It is possible to specify
1041       several characteristics of the address to match an error code::
1045          channel = the channel that will generate an error;
1054       For example, to generate an error at rank 1 of dimm 2, for any channel,
1065    - ``inject_eccmask``:
1068    - ``inject_section``:
1069        specifies what ECC cache section will get the error::
1075    - ``inject_type``:
1076        specifies the type of error, being a combination of the following bits::
1078 		bit 0 - repeat
1079 		bit 1 - ecc
1080 		bit 2 - parity
1082    - ``inject_enable``:
1083        starts the error generation when something different than 0 is written.
1087    Datasheet states that the error will only be generated after a write on an
1089    also produce an error.
1091    For example, the following code will generate an error for any write access
1104    The generated error message will look like::
1106 …-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome…
1108 3) Corrected Error memory register counters
1138    The hardware will increment udimm0 for an error at the first dimm at either
1141    The hardware will increment udimm1 for an error at the second dimm at either
1144    The hardware will increment udimm2 for an error at the third dimm at either
1147 4) Standard error counters
1149    The standard error counters are generated when an mcelog error is received
1155 ------------------------------------------
1158 (available from http://support.amd.com/en-us/search/tech-docs):
1181 	  Models 30h-3Fh Processors
1185    :Link: http://support.amd.com/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf
1188 	  Models 60h-6Fh Processors
1192    :Link: http://support.amd.com/TechDocs/50742_15h_Models_60h-6Fh_BKDG.pdf
1195 	  Models 00h-0Fh Processors
1206   - 7 Dec 2005
1207   - 17 Jul 2007	Updated
1211   - 05 Aug 2009	Nehalem interface
1212   - 26 Oct 2016 Converted to ReST and cleanups at the Nehalem section
1216   - Doug Thompson, Dave Jiang, Dave Peterson et al,
1217   - Mauro Carvalho Chehab
1218   - Borislav Petkov
1219   - original author: Thayne Harbaugh