xref: /openbmc/linux/Documentation/admin-guide/ras.rst (revision 1ac731c529cd4d6adbce134754b51ff7d822b145)
1fd77f6baSMauro Carvalho Chehab.. include:: <isonum.txt>
2fd77f6baSMauro Carvalho Chehab
3fd77f6baSMauro Carvalho Chehab============================================
4fd77f6baSMauro Carvalho ChehabReliability, Availability and Serviceability
5fd77f6baSMauro Carvalho Chehab============================================
6fd77f6baSMauro Carvalho Chehab
7fd77f6baSMauro Carvalho ChehabRAS concepts
8fd77f6baSMauro Carvalho Chehab************
9fd77f6baSMauro Carvalho Chehab
10fd77f6baSMauro Carvalho ChehabReliability, Availability and Serviceability (RAS) is a concept used on
119f02a486STamara Diaconitaservers meant to measure their robustness.
12fd77f6baSMauro Carvalho Chehab
13fd77f6baSMauro Carvalho ChehabReliability
14fd77f6baSMauro Carvalho Chehab  is the probability that a system will produce correct outputs.
15fd77f6baSMauro Carvalho Chehab
16fd77f6baSMauro Carvalho Chehab  * Generally measured as Mean Time Between Failures (MTBF)
17fd77f6baSMauro Carvalho Chehab  * Enhanced by features that help to avoid, detect and repair hardware faults
18fd77f6baSMauro Carvalho Chehab
19fd77f6baSMauro Carvalho ChehabAvailability
20fd77f6baSMauro Carvalho Chehab  is the probability that a system is operational at a given time
21fd77f6baSMauro Carvalho Chehab
22fd77f6baSMauro Carvalho Chehab  * Generally measured as a percentage of downtime per a period of time
23fd77f6baSMauro Carvalho Chehab  * Often uses mechanisms to detect and correct hardware faults in
24fd77f6baSMauro Carvalho Chehab    runtime;
25fd77f6baSMauro Carvalho Chehab
26fd77f6baSMauro Carvalho ChehabServiceability (or maintainability)
27fd77f6baSMauro Carvalho Chehab  is the simplicity and speed with which a system can be repaired or
28fd77f6baSMauro Carvalho Chehab  maintained
29fd77f6baSMauro Carvalho Chehab
30fd77f6baSMauro Carvalho Chehab  * Generally measured on Mean Time Between Repair (MTBR)
31fd77f6baSMauro Carvalho Chehab
32fd77f6baSMauro Carvalho ChehabImproving RAS
33fd77f6baSMauro Carvalho Chehab-------------
34fd77f6baSMauro Carvalho Chehab
35fd77f6baSMauro Carvalho ChehabIn order to reduce systems downtime, a system should be capable of detecting
36fd77f6baSMauro Carvalho Chehabhardware errors, and, when possible correcting them in runtime. It should
37fd77f6baSMauro Carvalho Chehabalso provide mechanisms to detect hardware degradation, in order to warn
38fd77f6baSMauro Carvalho Chehabthe system administrator to take the action of replacing a component before
39fd77f6baSMauro Carvalho Chehabit causes data loss or system downtime.
40fd77f6baSMauro Carvalho Chehab
41fd77f6baSMauro Carvalho ChehabAmong the monitoring measures, the most usual ones include:
42fd77f6baSMauro Carvalho Chehab
43fd77f6baSMauro Carvalho Chehab* CPU – detect errors at instruction execution and at L1/L2/L3 caches;
44fd77f6baSMauro Carvalho Chehab* Memory – add error correction logic (ECC) to detect and correct errors;
459f02a486STamara Diaconita* I/O – add CRC checksums for transferred data;
46fd77f6baSMauro Carvalho Chehab* Storage – RAID, journal file systems, checksums,
47fd77f6baSMauro Carvalho Chehab  Self-Monitoring, Analysis and Reporting Technology (SMART).
48fd77f6baSMauro Carvalho Chehab
49fd77f6baSMauro Carvalho ChehabBy monitoring the number of occurrences of error detections, it is possible
50fd77f6baSMauro Carvalho Chehabto identify if the probability of hardware errors is increasing, and, on such
519f02a486STamara Diaconitacase, do a preventive maintenance to replace a degraded component while
52fd77f6baSMauro Carvalho Chehabthose errors are correctable.
53fd77f6baSMauro Carvalho Chehab
54fd77f6baSMauro Carvalho ChehabTypes of errors
55fd77f6baSMauro Carvalho Chehab---------------
56fd77f6baSMauro Carvalho Chehab
579d436edeSGeert UytterhoevenMost mechanisms used on modern systems use technologies like Hamming
58fd77f6baSMauro Carvalho ChehabCodes that allow error correction when the number of errors on a bit packet
59fd77f6baSMauro Carvalho Chehabis below a threshold. If the number of errors is above, those mechanisms
60fd77f6baSMauro Carvalho Chehabcan indicate with a high degree of confidence that an error happened, but
61fd77f6baSMauro Carvalho Chehabthey can't correct.
62fd77f6baSMauro Carvalho Chehab
63fd77f6baSMauro Carvalho ChehabAlso, sometimes an error occur on a component that it is not used. For
64fd77f6baSMauro Carvalho Chehabexample, a part of the memory that it is not currently allocated.
65fd77f6baSMauro Carvalho Chehab
66fd77f6baSMauro Carvalho ChehabThat defines some categories of errors:
67fd77f6baSMauro Carvalho Chehab
68fd77f6baSMauro Carvalho Chehab* **Correctable Error (CE)** - the error detection mechanism detected and
69fd77f6baSMauro Carvalho Chehab  corrected the error. Such errors are usually not fatal, although some
70fd77f6baSMauro Carvalho Chehab  Kernel mechanisms allow the system administrator to consider them as fatal.
71fd77f6baSMauro Carvalho Chehab
72fd77f6baSMauro Carvalho Chehab* **Uncorrected Error (UE)** - the amount of errors happened above the error
73fd77f6baSMauro Carvalho Chehab  correction threshold, and the system was unable to auto-correct.
74fd77f6baSMauro Carvalho Chehab
75fd77f6baSMauro Carvalho Chehab* **Fatal Error** - when an UE error happens on a critical component of the
76fd77f6baSMauro Carvalho Chehab  system (for example, a piece of the Kernel got corrupted by an UE), the
77fd77f6baSMauro Carvalho Chehab  only reliable way to avoid data corruption is to hang or reboot the machine.
78fd77f6baSMauro Carvalho Chehab
79fd77f6baSMauro Carvalho Chehab* **Non-fatal Error** - when an UE error happens on an unused component,
80fd77f6baSMauro Carvalho Chehab  like a CPU in power down state or an unused memory bank, the system may
81fd77f6baSMauro Carvalho Chehab  still run, eventually replacing the affected hardware by a hot spare,
82fd77f6baSMauro Carvalho Chehab  if available.
83fd77f6baSMauro Carvalho Chehab
849332ef9dSMasahiro Yamada  Also, when an error happens on a userspace process, it is also possible to
85fd77f6baSMauro Carvalho Chehab  kill such process and let userspace restart it.
86fd77f6baSMauro Carvalho Chehab
87fd77f6baSMauro Carvalho ChehabThe mechanism for handling non-fatal errors is usually complex and may
88fd77f6baSMauro Carvalho Chehabrequire the help of some userspace application, in order to apply the
89fd77f6baSMauro Carvalho Chehabpolicy desired by the system administrator.
90fd77f6baSMauro Carvalho Chehab
91fd77f6baSMauro Carvalho ChehabIdentifying a bad hardware component
92fd77f6baSMauro Carvalho Chehab------------------------------------
93fd77f6baSMauro Carvalho Chehab
94fd77f6baSMauro Carvalho ChehabJust detecting a hardware flaw is usually not enough, as the system needs
95fd77f6baSMauro Carvalho Chehabto pinpoint to the minimal replaceable unit (MRU) that should be exchanged
96fd77f6baSMauro Carvalho Chehabto make the hardware reliable again.
97fd77f6baSMauro Carvalho Chehab
98fd77f6baSMauro Carvalho ChehabSo, it requires not only error logging facilities, but also mechanisms that
99fd77f6baSMauro Carvalho Chehabwill translate the error message to the silkscreen or component label for
100fd77f6baSMauro Carvalho Chehabthe MRU.
101fd77f6baSMauro Carvalho Chehab
102fd77f6baSMauro Carvalho ChehabTypically, it is very complex for memory, as modern CPUs interlace memory
103fd77f6baSMauro Carvalho Chehabfrom different memory modules, in order to provide a better performance. The
104fd77f6baSMauro Carvalho ChehabDMI BIOS usually have a list of memory module labels, with can be obtained
105fd77f6baSMauro Carvalho Chehabusing the ``dmidecode`` tool. For example, on a desktop machine, it shows::
106fd77f6baSMauro Carvalho Chehab
107fd77f6baSMauro Carvalho Chehab	Memory Device
108fd77f6baSMauro Carvalho Chehab		Total Width: 64 bits
109fd77f6baSMauro Carvalho Chehab		Data Width: 64 bits
110fd77f6baSMauro Carvalho Chehab		Size: 16384 MB
111fd77f6baSMauro Carvalho Chehab		Form Factor: SODIMM
112fd77f6baSMauro Carvalho Chehab		Set: None
113fd77f6baSMauro Carvalho Chehab		Locator: ChannelA-DIMM0
114fd77f6baSMauro Carvalho Chehab		Bank Locator: BANK 0
115fd77f6baSMauro Carvalho Chehab		Type: DDR4
116fd77f6baSMauro Carvalho Chehab		Type Detail: Synchronous
117fd77f6baSMauro Carvalho Chehab		Speed: 2133 MHz
118fd77f6baSMauro Carvalho Chehab		Rank: 2
119fd77f6baSMauro Carvalho Chehab		Configured Clock Speed: 2133 MHz
120fd77f6baSMauro Carvalho Chehab
121fd77f6baSMauro Carvalho ChehabOn the above example, a DDR4 SO-DIMM memory module is located at the
122fd77f6baSMauro Carvalho Chehabsystem's memory labeled as "BANK 0", as given by the *bank locator* field.
123fd77f6baSMauro Carvalho ChehabPlease notice that, on such system, the *total width* is equal to the
1249f02a486STamara Diaconita*data width*. It means that such memory module doesn't have error
125fd77f6baSMauro Carvalho Chehabdetection/correction mechanisms.
126fd77f6baSMauro Carvalho Chehab
127fd77f6baSMauro Carvalho ChehabUnfortunately, not all systems use the same field to specify the memory
128fd77f6baSMauro Carvalho Chehabbank. On this example, from an older server, ``dmidecode`` shows::
129fd77f6baSMauro Carvalho Chehab
130fd77f6baSMauro Carvalho Chehab	Memory Device
131fd77f6baSMauro Carvalho Chehab		Array Handle: 0x1000
132fd77f6baSMauro Carvalho Chehab		Error Information Handle: Not Provided
133fd77f6baSMauro Carvalho Chehab		Total Width: 72 bits
134fd77f6baSMauro Carvalho Chehab		Data Width: 64 bits
135fd77f6baSMauro Carvalho Chehab		Size: 8192 MB
136fd77f6baSMauro Carvalho Chehab		Form Factor: DIMM
137fd77f6baSMauro Carvalho Chehab		Set: 1
138fd77f6baSMauro Carvalho Chehab		Locator: DIMM_A1
139fd77f6baSMauro Carvalho Chehab		Bank Locator: Not Specified
140fd77f6baSMauro Carvalho Chehab		Type: DDR3
141fd77f6baSMauro Carvalho Chehab		Type Detail: Synchronous Registered (Buffered)
142fd77f6baSMauro Carvalho Chehab		Speed: 1600 MHz
143fd77f6baSMauro Carvalho Chehab		Rank: 2
144fd77f6baSMauro Carvalho Chehab		Configured Clock Speed: 1600 MHz
145fd77f6baSMauro Carvalho Chehab
146fd77f6baSMauro Carvalho ChehabThere, the DDR3 RDIMM memory module is located at the system's memory labeled
147fd77f6baSMauro Carvalho Chehabas "DIMM_A1", as given by the *locator* field. Please notice that this
1489f02a486STamara Diaconitamemory module has 64 bits of *data width* and 72 bits of *total width*. So,
149fd77f6baSMauro Carvalho Chehabit has 8 extra bits to be used by error detection and correction mechanisms.
150fd77f6baSMauro Carvalho ChehabSuch kind of memory is called Error-correcting code memory (ECC memory).
151fd77f6baSMauro Carvalho Chehab
152fd77f6baSMauro Carvalho ChehabTo make things even worse, it is not uncommon that systems with different
153fd77f6baSMauro Carvalho Chehablabels on their system's board to use exactly the same BIOS, meaning that
154fd77f6baSMauro Carvalho Chehabthe labels provided by the BIOS won't match the real ones.
155fd77f6baSMauro Carvalho Chehab
156fd77f6baSMauro Carvalho ChehabECC memory
157fd77f6baSMauro Carvalho Chehab----------
158fd77f6baSMauro Carvalho Chehab
159b17b24fcSWaiman LongAs mentioned in the previous section, ECC memory has extra bits to be
160b17b24fcSWaiman Longused for error correction. In the above example, a memory module has
161b17b24fcSWaiman Long64 bits of *data width*, and 72 bits of *total width*.  The extra 8
162b17b24fcSWaiman Longbits which are used for the error detection and correction mechanisms
163b17b24fcSWaiman Longare referred to as the *syndrome*\ [#f1]_\ [#f2]_.
164fd77f6baSMauro Carvalho Chehab
165fd77f6baSMauro Carvalho ChehabSo, when the cpu requests the memory controller to write a word with
166fd77f6baSMauro Carvalho Chehab*data width*, the memory controller calculates the *syndrome* in real time,
167fd77f6baSMauro Carvalho Chehabusing Hamming code, or some other error correction code, like SECDED+,
168fd77f6baSMauro Carvalho Chehabproducing a code with *total width* size. Such code is then written
169fd77f6baSMauro Carvalho Chehabon the memory modules.
170fd77f6baSMauro Carvalho Chehab
171fd77f6baSMauro Carvalho ChehabAt read, the *total width* bits code is converted back, using the same
172fd77f6baSMauro Carvalho ChehabECC code used on write, producing a word with *data width* and a *syndrome*.
173fd77f6baSMauro Carvalho ChehabThe word with *data width* is sent to the CPU, even when errors happen.
174fd77f6baSMauro Carvalho Chehab
175fd77f6baSMauro Carvalho ChehabThe memory controller also looks at the *syndrome* in order to check if
176fd77f6baSMauro Carvalho Chehabthere was an error, and if the ECC code was able to fix such error.
177fd77f6baSMauro Carvalho ChehabIf the error was corrected, a Corrected Error (CE) happened. If not, an
178fd77f6baSMauro Carvalho ChehabUncorrected Error (UE) happened.
179fd77f6baSMauro Carvalho Chehab
180fd77f6baSMauro Carvalho ChehabThe information about the CE/UE errors is stored on some special registers
181fd77f6baSMauro Carvalho Chehabat the memory controller and can be accessed by reading such registers,
182fd77f6baSMauro Carvalho Chehabeither by BIOS, by some special CPUs or by Linux EDAC driver. On x86 64
183fd77f6baSMauro Carvalho Chehabbit CPUs, such errors can also be retrieved via the Machine Check
184fd77f6baSMauro Carvalho ChehabArchitecture (MCA)\ [#f3]_.
185fd77f6baSMauro Carvalho Chehab
186fd77f6baSMauro Carvalho Chehab.. [#f1] Please notice that several memory controllers allow operation on a
187fd77f6baSMauro Carvalho Chehab  mode called "Lock-Step", where it groups two memory modules together,
188fd77f6baSMauro Carvalho Chehab  doing 128-bit reads/writes. That gives 16 bits for error correction, with
1899f02a486STamara Diaconita  significantly improves the error correction mechanism, at the expense
190fd77f6baSMauro Carvalho Chehab  that, when an error happens, there's no way to know what memory module is
191fd77f6baSMauro Carvalho Chehab  to blame. So, it has to blame both memory modules.
192fd77f6baSMauro Carvalho Chehab
193fd77f6baSMauro Carvalho Chehab.. [#f2] Some memory controllers also allow using memory in mirror mode.
194fd77f6baSMauro Carvalho Chehab  On such mode, the same data is written to two memory modules. At read,
195fd77f6baSMauro Carvalho Chehab  the system checks both memory modules, in order to check if both provide
196fd77f6baSMauro Carvalho Chehab  identical data. On such configuration, when an error happens, there's no
197fd77f6baSMauro Carvalho Chehab  way to know what memory module is to blame. So, it has to blame both
198fd77f6baSMauro Carvalho Chehab  memory modules (or 4 memory modules, if the system is also on Lock-step
199fd77f6baSMauro Carvalho Chehab  mode).
200fd77f6baSMauro Carvalho Chehab
201fd77f6baSMauro Carvalho Chehab.. [#f3] For more details about the Machine Check Architecture (MCA),
202*ff61f079SJonathan Corbet  please read Documentation/arch/x86/x86_64/machinecheck.rst at the Kernel tree.
203fd77f6baSMauro Carvalho Chehab
204fd77f6baSMauro Carvalho ChehabEDAC - Error Detection And Correction
205fd77f6baSMauro Carvalho Chehab*************************************
206fd77f6baSMauro Carvalho Chehab
207fd77f6baSMauro Carvalho Chehab.. note::
208fd77f6baSMauro Carvalho Chehab
209fd77f6baSMauro Carvalho Chehab   "bluesmoke" was the name for this device driver subsystem when it
210fd77f6baSMauro Carvalho Chehab   was "out-of-tree" and maintained at http://bluesmoke.sourceforge.net.
211fd77f6baSMauro Carvalho Chehab   That site is mostly archaic now and can be used only for historical
212fd77f6baSMauro Carvalho Chehab   purposes.
213fd77f6baSMauro Carvalho Chehab
214fd77f6baSMauro Carvalho Chehab   When the subsystem was pushed upstream for the first time, on
21500aff956SMauro Carvalho Chehab   Kernel 2.6.16, it was renamed to ``EDAC``.
216fd77f6baSMauro Carvalho Chehab
217fd77f6baSMauro Carvalho ChehabPurpose
218fd77f6baSMauro Carvalho Chehab-------
219fd77f6baSMauro Carvalho Chehab
220fd77f6baSMauro Carvalho ChehabThe ``edac`` kernel module's goal is to detect and report hardware errors
221fd77f6baSMauro Carvalho Chehabthat occur within the computer system running under linux.
222fd77f6baSMauro Carvalho Chehab
223fd77f6baSMauro Carvalho ChehabMemory
224fd77f6baSMauro Carvalho Chehab------
225fd77f6baSMauro Carvalho Chehab
226fd77f6baSMauro Carvalho ChehabMemory Correctable Errors (CE) and Uncorrectable Errors (UE) are the
227fd77f6baSMauro Carvalho Chehabprimary errors being harvested. These types of errors are harvested by
228fd77f6baSMauro Carvalho Chehabthe ``edac_mc`` device.
229fd77f6baSMauro Carvalho Chehab
230fd77f6baSMauro Carvalho ChehabDetecting CE events, then harvesting those events and reporting them,
231fd77f6baSMauro Carvalho Chehab**can** but must not necessarily be a predictor of future UE events. With
232fd77f6baSMauro Carvalho ChehabCE events only, the system can and will continue to operate as no data
233fd77f6baSMauro Carvalho Chehabhas been damaged yet.
234fd77f6baSMauro Carvalho Chehab
235fd77f6baSMauro Carvalho ChehabHowever, preventive maintenance and proactive part replacement of memory
236fd77f6baSMauro Carvalho Chehabmodules exhibiting CEs can reduce the likelihood of the dreaded UE events
237fd77f6baSMauro Carvalho Chehaband system panics.
238fd77f6baSMauro Carvalho Chehab
239fd77f6baSMauro Carvalho ChehabOther hardware elements
240fd77f6baSMauro Carvalho Chehab-----------------------
241fd77f6baSMauro Carvalho Chehab
242fd77f6baSMauro Carvalho ChehabA new feature for EDAC, the ``edac_device`` class of device, was added in
243fd77f6baSMauro Carvalho Chehabthe 2.6.23 version of the kernel.
244fd77f6baSMauro Carvalho Chehab
245fd77f6baSMauro Carvalho ChehabThis new device type allows for non-memory type of ECC hardware detectors
246fd77f6baSMauro Carvalho Chehabto have their states harvested and presented to userspace via the sysfs
247fd77f6baSMauro Carvalho Chehabinterface.
248fd77f6baSMauro Carvalho Chehab
249fd77f6baSMauro Carvalho ChehabSome architectures have ECC detectors for L1, L2 and L3 caches,
250fd77f6baSMauro Carvalho Chehabalong with DMA engines, fabric switches, main data path switches,
251fd77f6baSMauro Carvalho Chehabinterconnections, and various other hardware data paths. If the hardware
252fd77f6baSMauro Carvalho Chehabreports it, then a edac_device device probably can be constructed to
253fd77f6baSMauro Carvalho Chehabharvest and present that to userspace.
254fd77f6baSMauro Carvalho Chehab
255fd77f6baSMauro Carvalho Chehab
256fd77f6baSMauro Carvalho ChehabPCI bus scanning
257fd77f6baSMauro Carvalho Chehab----------------
258fd77f6baSMauro Carvalho Chehab
259fd77f6baSMauro Carvalho ChehabIn addition, PCI devices are scanned for PCI Bus Parity and SERR Errors
260fd77f6baSMauro Carvalho Chehabin order to determine if errors are occurring during data transfers.
261fd77f6baSMauro Carvalho Chehab
262fd77f6baSMauro Carvalho ChehabThe presence of PCI Parity errors must be examined with a grain of salt.
263fd77f6baSMauro Carvalho ChehabThere are several add-in adapters that do **not** follow the PCI specification
264fd77f6baSMauro Carvalho Chehabwith regards to Parity generation and reporting. The specification says
265fd77f6baSMauro Carvalho Chehabthe vendor should tie the parity status bits to 0 if they do not intend
266fd77f6baSMauro Carvalho Chehabto generate parity.  Some vendors do not do this, and thus the parity bit
267fd77f6baSMauro Carvalho Chehabcan "float" giving false positives.
268fd77f6baSMauro Carvalho Chehab
269fd77f6baSMauro Carvalho ChehabThere is a PCI device attribute located in sysfs that is checked by
270fd77f6baSMauro Carvalho Chehabthe EDAC PCI scanning code. If that attribute is set, PCI parity/error
271fd77f6baSMauro Carvalho Chehabscanning is skipped for that device. The attribute is::
272fd77f6baSMauro Carvalho Chehab
273fd77f6baSMauro Carvalho Chehab	broken_parity_status
274fd77f6baSMauro Carvalho Chehab
275fd77f6baSMauro Carvalho Chehaband is located in ``/sys/devices/pci<XXX>/0000:XX:YY.Z`` directories for
276fd77f6baSMauro Carvalho ChehabPCI devices.
277fd77f6baSMauro Carvalho Chehab
278fd77f6baSMauro Carvalho Chehab
279fd77f6baSMauro Carvalho ChehabVersioning
280fd77f6baSMauro Carvalho Chehab----------
281fd77f6baSMauro Carvalho Chehab
282fd77f6baSMauro Carvalho ChehabEDAC is composed of a "core" module (``edac_core.ko``) and several Memory
283fd77f6baSMauro Carvalho ChehabController (MC) driver modules. On a given system, the CORE is loaded
284fd77f6baSMauro Carvalho Chehaband one MC driver will be loaded. Both the CORE and the MC driver (or
285fd77f6baSMauro Carvalho Chehab``edac_device`` driver) have individual versions that reflect current
286fd77f6baSMauro Carvalho Chehabrelease level of their respective modules.
287fd77f6baSMauro Carvalho Chehab
288fd77f6baSMauro Carvalho ChehabThus, to "report" on what version a system is running, one must report
289fd77f6baSMauro Carvalho Chehabboth the CORE's and the MC driver's versions.
290fd77f6baSMauro Carvalho Chehab
291fd77f6baSMauro Carvalho Chehab
292fd77f6baSMauro Carvalho ChehabLoading
293fd77f6baSMauro Carvalho Chehab-------
294fd77f6baSMauro Carvalho Chehab
295fd77f6baSMauro Carvalho ChehabIf ``edac`` was statically linked with the kernel then no loading
296fd77f6baSMauro Carvalho Chehabis necessary. If ``edac`` was built as modules then simply modprobe
297fd77f6baSMauro Carvalho Chehabthe ``edac`` pieces that you need. You should be able to modprobe
298fd77f6baSMauro Carvalho Chehabhardware-specific modules and have the dependencies load the necessary
299fd77f6baSMauro Carvalho Chehabcore modules.
300fd77f6baSMauro Carvalho Chehab
301fd77f6baSMauro Carvalho ChehabExample::
302fd77f6baSMauro Carvalho Chehab
303fd77f6baSMauro Carvalho Chehab	$ modprobe amd76x_edac
304fd77f6baSMauro Carvalho Chehab
305fd77f6baSMauro Carvalho Chehabloads both the ``amd76x_edac.ko`` memory controller module and the
306fd77f6baSMauro Carvalho Chehab``edac_mc.ko`` core module.
307fd77f6baSMauro Carvalho Chehab
308fd77f6baSMauro Carvalho Chehab
309fd77f6baSMauro Carvalho ChehabSysfs interface
310fd77f6baSMauro Carvalho Chehab---------------
311fd77f6baSMauro Carvalho Chehab
312fd77f6baSMauro Carvalho ChehabEDAC presents a ``sysfs`` interface for control and reporting purposes. It
313fd77f6baSMauro Carvalho Chehablives in the /sys/devices/system/edac directory.
314fd77f6baSMauro Carvalho Chehab
315fd77f6baSMauro Carvalho ChehabWithin this directory there currently reside 2 components:
316fd77f6baSMauro Carvalho Chehab
317fd77f6baSMauro Carvalho Chehab	======= ==============================
318fd77f6baSMauro Carvalho Chehab	mc	memory controller(s) system
319fd77f6baSMauro Carvalho Chehab	pci	PCI control and status system
320fd77f6baSMauro Carvalho Chehab	======= ==============================
321fd77f6baSMauro Carvalho Chehab
322fd77f6baSMauro Carvalho Chehab
323fd77f6baSMauro Carvalho Chehab
324fd77f6baSMauro Carvalho ChehabMemory Controller (mc) Model
325fd77f6baSMauro Carvalho Chehab----------------------------
326fd77f6baSMauro Carvalho Chehab
327fd77f6baSMauro Carvalho ChehabEach ``mc`` device controls a set of memory modules [#f4]_. These modules
328fd77f6baSMauro Carvalho Chehabare laid out in a Chip-Select Row (``csrowX``) and Channel table (``chX``).
329fd77f6baSMauro Carvalho ChehabThere can be multiple csrows and multiple channels.
330fd77f6baSMauro Carvalho Chehab
331fd77f6baSMauro Carvalho Chehab.. [#f4] Nowadays, the term DIMM (Dual In-line Memory Module) is widely
332fd77f6baSMauro Carvalho Chehab  used to refer to a memory module, although there are other memory
333778f3a96SRobert Richter  packaging alternatives, like SO-DIMM, SIMM, etc. The UEFI
334778f3a96SRobert Richter  specification (Version 2.7) defines a memory module in the Common
335778f3a96SRobert Richter  Platform Error Record (CPER) section to be an SMBIOS Memory Device
336778f3a96SRobert Richter  (Type 17). Along this document, and inside the EDAC subsystem, the term
337778f3a96SRobert Richter  "dimm" is used for all memory modules, even when they use a
338778f3a96SRobert Richter  different kind of packaging.
339fd77f6baSMauro Carvalho Chehab
340fd77f6baSMauro Carvalho ChehabMemory controllers allow for several csrows, with 8 csrows being a
341fd77f6baSMauro Carvalho Chehabtypical value. Yet, the actual number of csrows depends on the layout of
342fd77f6baSMauro Carvalho Chehaba given motherboard, memory controller and memory module characteristics.
343fd77f6baSMauro Carvalho Chehab
344fd77f6baSMauro Carvalho ChehabDual channels allow for dual data length (e. g. 128 bits, on 64 bit systems)
345fd77f6baSMauro Carvalho Chehabdata transfers to/from the CPU from/to memory. Some newer chipsets allow
346fd77f6baSMauro Carvalho Chehabfor more than 2 channels, like Fully Buffered DIMMs (FB-DIMMs) memory
347fd77f6baSMauro Carvalho Chehabcontrollers. The following example will assume 2 channels:
348fd77f6baSMauro Carvalho Chehab
349fd77f6baSMauro Carvalho Chehab	+------------+-----------------------+
35082a19551SJonathan Corbet	| CS Rows    |       Channels        |
35182a19551SJonathan Corbet	+------------+-----------+-----------+
35282a19551SJonathan Corbet	|            |  ``ch0``  |  ``ch1``  |
353fd77f6baSMauro Carvalho Chehab	+============+===========+===========+
354cfa20498SMauro Carvalho Chehab	|            |**DIMM_A0**|**DIMM_B0**|
355cfa20498SMauro Carvalho Chehab	+------------+-----------+-----------+
356cfa20498SMauro Carvalho Chehab	| ``csrow0`` |   rank0   |   rank0   |
357cfa20498SMauro Carvalho Chehab	+------------+-----------+-----------+
358778f3a96SRobert Richter	| ``csrow1`` |   rank1   |   rank1   |
359fd77f6baSMauro Carvalho Chehab	+------------+-----------+-----------+
360cfa20498SMauro Carvalho Chehab	|            |**DIMM_A1**|**DIMM_B1**|
361cfa20498SMauro Carvalho Chehab	+------------+-----------+-----------+
362cfa20498SMauro Carvalho Chehab	| ``csrow2`` |    rank0  |  rank0    |
363cfa20498SMauro Carvalho Chehab	+------------+-----------+-----------+
364778f3a96SRobert Richter	| ``csrow3`` |    rank1  |  rank1    |
365fd77f6baSMauro Carvalho Chehab	+------------+-----------+-----------+
366fd77f6baSMauro Carvalho Chehab
367fd77f6baSMauro Carvalho ChehabIn the above example, there are 4 physical slots on the motherboard
368fd77f6baSMauro Carvalho Chehabfor memory DIMMs:
369fd77f6baSMauro Carvalho Chehab
370fd77f6baSMauro Carvalho Chehab	+---------+---------+
371fd77f6baSMauro Carvalho Chehab	| DIMM_A0 | DIMM_B0 |
372fd77f6baSMauro Carvalho Chehab	+---------+---------+
373fd77f6baSMauro Carvalho Chehab	| DIMM_A1 | DIMM_B1 |
374fd77f6baSMauro Carvalho Chehab	+---------+---------+
375fd77f6baSMauro Carvalho Chehab
376fd77f6baSMauro Carvalho ChehabLabels for these slots are usually silk-screened on the motherboard.
377fd77f6baSMauro Carvalho ChehabSlots labeled ``A`` are channel 0 in this example. Slots labeled ``B`` are
378fd77f6baSMauro Carvalho Chehabchannel 1. Notice that there are two csrows possible on a physical DIMM.
379fd77f6baSMauro Carvalho ChehabThese csrows are allocated their csrow assignment based on the slot into
380fd77f6baSMauro Carvalho Chehabwhich the memory DIMM is placed. Thus, when 1 DIMM is placed in each
381fd77f6baSMauro Carvalho ChehabChannel, the csrows cross both DIMMs.
382fd77f6baSMauro Carvalho Chehab
383fd77f6baSMauro Carvalho ChehabMemory DIMMs come single or dual "ranked". A rank is a populated csrow.
384778f3a96SRobert RichterIn the example above 2 dual ranked DIMMs are similarly placed. Thus,
385778f3a96SRobert Richterboth csrow0 and csrow1 are populated. On the other hand, when 2 single
386778f3a96SRobert Richterranked DIMMs are placed in slots DIMM_A0 and DIMM_B0, then they will
387778f3a96SRobert Richterhave just one csrow (csrow0) and csrow1 will be empty. The pattern
388778f3a96SRobert Richterrepeats itself for csrow2 and csrow3. Also note that some memory
389778f3a96SRobert Richtercontrollers don't have any logic to identify the memory module, see
390778f3a96SRobert Richter``rankX`` directories below.
391fd77f6baSMauro Carvalho Chehab
392fd77f6baSMauro Carvalho ChehabThe representation of the above is reflected in the directory
393fd77f6baSMauro Carvalho Chehabtree in EDAC's sysfs interface. Starting in directory
394fd77f6baSMauro Carvalho Chehab``/sys/devices/system/edac/mc``, each memory controller will be
395fd77f6baSMauro Carvalho Chehabrepresented by its own ``mcX`` directory, where ``X`` is the
396fd77f6baSMauro Carvalho Chehabindex of the MC::
397fd77f6baSMauro Carvalho Chehab
398fd77f6baSMauro Carvalho Chehab	..../edac/mc/
399fd77f6baSMauro Carvalho Chehab		   |
400fd77f6baSMauro Carvalho Chehab		   |->mc0
401fd77f6baSMauro Carvalho Chehab		   |->mc1
402fd77f6baSMauro Carvalho Chehab		   |->mc2
403fd77f6baSMauro Carvalho Chehab		   ....
404fd77f6baSMauro Carvalho Chehab
405fd77f6baSMauro Carvalho ChehabUnder each ``mcX`` directory each ``csrowX`` is again represented by a
406fd77f6baSMauro Carvalho Chehab``csrowX``, where ``X`` is the csrow index::
407fd77f6baSMauro Carvalho Chehab
408fd77f6baSMauro Carvalho Chehab	.../mc/mc0/
409fd77f6baSMauro Carvalho Chehab		|
410fd77f6baSMauro Carvalho Chehab		|->csrow0
411fd77f6baSMauro Carvalho Chehab		|->csrow2
412fd77f6baSMauro Carvalho Chehab		|->csrow3
413fd77f6baSMauro Carvalho Chehab		....
414fd77f6baSMauro Carvalho Chehab
415fd77f6baSMauro Carvalho ChehabNotice that there is no csrow1, which indicates that csrow0 is composed
416fd77f6baSMauro Carvalho Chehabof a single ranked DIMMs. This should also apply in both Channels, in
417fd77f6baSMauro Carvalho Chehaborder to have dual-channel mode be operational. Since both csrow2 and
418fd77f6baSMauro Carvalho Chehabcsrow3 are populated, this indicates a dual ranked set of DIMMs for
419fd77f6baSMauro Carvalho Chehabchannels 0 and 1.
420fd77f6baSMauro Carvalho Chehab
421fd77f6baSMauro Carvalho ChehabWithin each of the ``mcX`` and ``csrowX`` directories are several EDAC
422fd77f6baSMauro Carvalho Chehabcontrol and attribute files.
423fd77f6baSMauro Carvalho Chehab
424fd77f6baSMauro Carvalho Chehab``mcX`` directories
425fd77f6baSMauro Carvalho Chehab-------------------
426fd77f6baSMauro Carvalho Chehab
427fd77f6baSMauro Carvalho ChehabIn ``mcX`` directories are EDAC control and attribute files for
428fd77f6baSMauro Carvalho Chehabthis ``X`` instance of the memory controllers.
429fd77f6baSMauro Carvalho Chehab
430fd77f6baSMauro Carvalho ChehabFor a description of the sysfs API, please see:
431fd77f6baSMauro Carvalho Chehab
432fd77f6baSMauro Carvalho Chehab	Documentation/ABI/testing/sysfs-devices-edac
433fd77f6baSMauro Carvalho Chehab
434fd77f6baSMauro Carvalho Chehab
435fd77f6baSMauro Carvalho Chehab``dimmX`` or ``rankX`` directories
436fd77f6baSMauro Carvalho Chehab----------------------------------
437fd77f6baSMauro Carvalho Chehab
438fd77f6baSMauro Carvalho ChehabThe recommended way to use the EDAC subsystem is to look at the information
439fd77f6baSMauro Carvalho Chehabprovided by the ``dimmX`` or ``rankX`` directories [#f5]_.
440fd77f6baSMauro Carvalho Chehab
441fd77f6baSMauro Carvalho ChehabA typical EDAC system has the following structure under
442fd77f6baSMauro Carvalho Chehab``/sys/devices/system/edac/``\ [#f6]_::
443fd77f6baSMauro Carvalho Chehab
444fd77f6baSMauro Carvalho Chehab	/sys/devices/system/edac/
445fd77f6baSMauro Carvalho Chehab	├── mc
446fd77f6baSMauro Carvalho Chehab	│   ├── mc0
447fd77f6baSMauro Carvalho Chehab	│   │   ├── ce_count
448fd77f6baSMauro Carvalho Chehab	│   │   ├── ce_noinfo_count
449fd77f6baSMauro Carvalho Chehab	│   │   ├── dimm0
4504fb6fde7SAaron Miller	│   │   │   ├── dimm_ce_count
451fd77f6baSMauro Carvalho Chehab	│   │   │   ├── dimm_dev_type
452fd77f6baSMauro Carvalho Chehab	│   │   │   ├── dimm_edac_mode
453fd77f6baSMauro Carvalho Chehab	│   │   │   ├── dimm_label
454fd77f6baSMauro Carvalho Chehab	│   │   │   ├── dimm_location
455fd77f6baSMauro Carvalho Chehab	│   │   │   ├── dimm_mem_type
4564fb6fde7SAaron Miller	│   │   │   ├── dimm_ue_count
457fd77f6baSMauro Carvalho Chehab	│   │   │   ├── size
458fd77f6baSMauro Carvalho Chehab	│   │   │   └── uevent
459fd77f6baSMauro Carvalho Chehab	│   │   ├── max_location
460fd77f6baSMauro Carvalho Chehab	│   │   ├── mc_name
461fd77f6baSMauro Carvalho Chehab	│   │   ├── reset_counters
462fd77f6baSMauro Carvalho Chehab	│   │   ├── seconds_since_reset
463fd77f6baSMauro Carvalho Chehab	│   │   ├── size_mb
464fd77f6baSMauro Carvalho Chehab	│   │   ├── ue_count
465fd77f6baSMauro Carvalho Chehab	│   │   ├── ue_noinfo_count
466fd77f6baSMauro Carvalho Chehab	│   │   └── uevent
467fd77f6baSMauro Carvalho Chehab	│   ├── mc1
468fd77f6baSMauro Carvalho Chehab	│   │   ├── ce_count
469fd77f6baSMauro Carvalho Chehab	│   │   ├── ce_noinfo_count
470fd77f6baSMauro Carvalho Chehab	│   │   ├── dimm0
4714fb6fde7SAaron Miller	│   │   │   ├── dimm_ce_count
472fd77f6baSMauro Carvalho Chehab	│   │   │   ├── dimm_dev_type
473fd77f6baSMauro Carvalho Chehab	│   │   │   ├── dimm_edac_mode
474fd77f6baSMauro Carvalho Chehab	│   │   │   ├── dimm_label
475fd77f6baSMauro Carvalho Chehab	│   │   │   ├── dimm_location
476fd77f6baSMauro Carvalho Chehab	│   │   │   ├── dimm_mem_type
4774fb6fde7SAaron Miller	│   │   │   ├── dimm_ue_count
478fd77f6baSMauro Carvalho Chehab	│   │   │   ├── size
479fd77f6baSMauro Carvalho Chehab	│   │   │   └── uevent
480fd77f6baSMauro Carvalho Chehab	│   │   ├── max_location
481fd77f6baSMauro Carvalho Chehab	│   │   ├── mc_name
482fd77f6baSMauro Carvalho Chehab	│   │   ├── reset_counters
483fd77f6baSMauro Carvalho Chehab	│   │   ├── seconds_since_reset
484fd77f6baSMauro Carvalho Chehab	│   │   ├── size_mb
485fd77f6baSMauro Carvalho Chehab	│   │   ├── ue_count
486fd77f6baSMauro Carvalho Chehab	│   │   ├── ue_noinfo_count
487fd77f6baSMauro Carvalho Chehab	│   │   └── uevent
488fd77f6baSMauro Carvalho Chehab	│   └── uevent
489fd77f6baSMauro Carvalho Chehab	└── uevent
490fd77f6baSMauro Carvalho Chehab
491fd77f6baSMauro Carvalho ChehabIn the ``dimmX`` directories are EDAC control and attribute files for
492fd77f6baSMauro Carvalho Chehabthis ``X`` memory module:
493fd77f6baSMauro Carvalho Chehab
494fd77f6baSMauro Carvalho Chehab- ``size`` - Total memory managed by this csrow attribute file
495fd77f6baSMauro Carvalho Chehab
496fd77f6baSMauro Carvalho Chehab	This attribute file displays, in count of megabytes, the memory
497fd77f6baSMauro Carvalho Chehab	that this csrow contains.
498fd77f6baSMauro Carvalho Chehab
4994fb6fde7SAaron Miller- ``dimm_ue_count`` - Uncorrectable Errors count attribute file
5004fb6fde7SAaron Miller
5014fb6fde7SAaron Miller	This attribute file displays the total count of uncorrectable
5024fb6fde7SAaron Miller	errors that have occurred on this DIMM. If panic_on_ue is set
5034fb6fde7SAaron Miller	this counter will not have a chance to increment, since EDAC
5044fb6fde7SAaron Miller	will panic the system.
5054fb6fde7SAaron Miller
5064fb6fde7SAaron Miller- ``dimm_ce_count`` - Correctable Errors count attribute file
5074fb6fde7SAaron Miller
5084fb6fde7SAaron Miller	This attribute file displays the total count of correctable
5094fb6fde7SAaron Miller	errors that have occurred on this DIMM. This count is very
5104fb6fde7SAaron Miller	important to examine. CEs provide early indications that a
5114fb6fde7SAaron Miller	DIMM is beginning to fail. This count field should be
5124fb6fde7SAaron Miller	monitored for non-zero values and report such information
5134fb6fde7SAaron Miller	to the system administrator.
5144fb6fde7SAaron Miller
515fd77f6baSMauro Carvalho Chehab- ``dimm_dev_type``  - Device type attribute file
516fd77f6baSMauro Carvalho Chehab
517fd77f6baSMauro Carvalho Chehab	This attribute file will display what type of DRAM device is
518fd77f6baSMauro Carvalho Chehab	being utilized on this DIMM.
519fd77f6baSMauro Carvalho Chehab	Examples:
520fd77f6baSMauro Carvalho Chehab
521fd77f6baSMauro Carvalho Chehab		- x1
522fd77f6baSMauro Carvalho Chehab		- x2
523fd77f6baSMauro Carvalho Chehab		- x4
524fd77f6baSMauro Carvalho Chehab		- x8
525fd77f6baSMauro Carvalho Chehab
526fd77f6baSMauro Carvalho Chehab- ``dimm_edac_mode`` - EDAC Mode of operation attribute file
527fd77f6baSMauro Carvalho Chehab
528fd77f6baSMauro Carvalho Chehab	This attribute file will display what type of Error detection
529fd77f6baSMauro Carvalho Chehab	and correction is being utilized.
530fd77f6baSMauro Carvalho Chehab
531fd77f6baSMauro Carvalho Chehab- ``dimm_label`` - memory module label control file
532fd77f6baSMauro Carvalho Chehab
533fd77f6baSMauro Carvalho Chehab	This control file allows this DIMM to have a label assigned
534fd77f6baSMauro Carvalho Chehab	to it. With this label in the module, when errors occur
535fd77f6baSMauro Carvalho Chehab	the output can provide the DIMM label in the system log.
536fd77f6baSMauro Carvalho Chehab	This becomes vital for panic events to isolate the
537fd77f6baSMauro Carvalho Chehab	cause of the UE event.
538fd77f6baSMauro Carvalho Chehab
539fd77f6baSMauro Carvalho Chehab	DIMM Labels must be assigned after booting, with information
540fd77f6baSMauro Carvalho Chehab	that correctly identifies the physical slot with its
541fd77f6baSMauro Carvalho Chehab	silk screen label. This information is currently very
542fd77f6baSMauro Carvalho Chehab	motherboard specific and determination of this information
543fd77f6baSMauro Carvalho Chehab	must occur in userland at this time.
544fd77f6baSMauro Carvalho Chehab
545fd77f6baSMauro Carvalho Chehab- ``dimm_location`` - location of the memory module
546fd77f6baSMauro Carvalho Chehab
547fd77f6baSMauro Carvalho Chehab	The location can have up to 3 levels, and describe how the
548fd77f6baSMauro Carvalho Chehab	memory controller identifies the location of a memory module.
549fd77f6baSMauro Carvalho Chehab	Depending on the type of memory and memory controller, it
550fd77f6baSMauro Carvalho Chehab	can be:
551fd77f6baSMauro Carvalho Chehab
552fd77f6baSMauro Carvalho Chehab		- *csrow* and *channel* - used when the memory controller
553fd77f6baSMauro Carvalho Chehab		  doesn't identify a single DIMM - e. g. in ``rankX`` dir;
554fd77f6baSMauro Carvalho Chehab		- *branch*, *channel*, *slot* - typically used on FB-DIMM memory
555fd77f6baSMauro Carvalho Chehab		  controllers;
556fd77f6baSMauro Carvalho Chehab		- *channel*, *slot* - used on Nehalem and newer Intel drivers.
557fd77f6baSMauro Carvalho Chehab
558fd77f6baSMauro Carvalho Chehab- ``dimm_mem_type`` - Memory Type attribute file
559fd77f6baSMauro Carvalho Chehab
560fd77f6baSMauro Carvalho Chehab	This attribute file will display what type of memory is currently
561fd77f6baSMauro Carvalho Chehab	on this csrow. Normally, either buffered or unbuffered memory.
562fd77f6baSMauro Carvalho Chehab	Examples:
563fd77f6baSMauro Carvalho Chehab
564fd77f6baSMauro Carvalho Chehab		- Registered-DDR
565fd77f6baSMauro Carvalho Chehab		- Unbuffered-DDR
566fd77f6baSMauro Carvalho Chehab
567fd77f6baSMauro Carvalho Chehab.. [#f5] On some systems, the memory controller doesn't have any logic
568fd77f6baSMauro Carvalho Chehab  to identify the memory module. On such systems, the directory is called ``rankX`` and works on a similar way as the ``csrowX`` directories.
569fd77f6baSMauro Carvalho Chehab  On modern Intel memory controllers, the memory controller identifies the
570fd77f6baSMauro Carvalho Chehab  memory modules directly. On such systems, the directory is called ``dimmX``.
571fd77f6baSMauro Carvalho Chehab
572fd77f6baSMauro Carvalho Chehab.. [#f6] There are also some ``power`` directories and ``subsystem``
573fd77f6baSMauro Carvalho Chehab  symlinks inside the sysfs mapping that are automatically created by
574fd77f6baSMauro Carvalho Chehab  the sysfs subsystem. Currently, they serve no purpose.
575fd77f6baSMauro Carvalho Chehab
576fd77f6baSMauro Carvalho Chehab``csrowX`` directories
577fd77f6baSMauro Carvalho Chehab----------------------
578fd77f6baSMauro Carvalho Chehab
579fd77f6baSMauro Carvalho ChehabWhen CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the ``csrowX``
580fd77f6baSMauro Carvalho Chehabdirectories. As this API doesn't work properly for Rambus, FB-DIMMs and
581fd77f6baSMauro Carvalho Chehabmodern Intel Memory Controllers, this is being deprecated in favor of
582fd77f6baSMauro Carvalho Chehab``dimmX`` directories.
583fd77f6baSMauro Carvalho Chehab
584fd77f6baSMauro Carvalho ChehabIn the ``csrowX`` directories are EDAC control and attribute files for
585fd77f6baSMauro Carvalho Chehabthis ``X`` instance of csrow:
586fd77f6baSMauro Carvalho Chehab
587fd77f6baSMauro Carvalho Chehab
588fd77f6baSMauro Carvalho Chehab- ``ue_count`` - Total Uncorrectable Errors count attribute file
589fd77f6baSMauro Carvalho Chehab
590fd77f6baSMauro Carvalho Chehab	This attribute file displays the total count of uncorrectable
591fd77f6baSMauro Carvalho Chehab	errors that have occurred on this csrow. If panic_on_ue is set
592fd77f6baSMauro Carvalho Chehab	this counter will not have a chance to increment, since EDAC
593fd77f6baSMauro Carvalho Chehab	will panic the system.
594fd77f6baSMauro Carvalho Chehab
595fd77f6baSMauro Carvalho Chehab
596fd77f6baSMauro Carvalho Chehab- ``ce_count`` - Total Correctable Errors count attribute file
597fd77f6baSMauro Carvalho Chehab
598fd77f6baSMauro Carvalho Chehab	This attribute file displays the total count of correctable
599fd77f6baSMauro Carvalho Chehab	errors that have occurred on this csrow. This count is very
600fd77f6baSMauro Carvalho Chehab	important to examine. CEs provide early indications that a
601fd77f6baSMauro Carvalho Chehab	DIMM is beginning to fail. This count field should be
602fd77f6baSMauro Carvalho Chehab	monitored for non-zero values and report such information
603fd77f6baSMauro Carvalho Chehab	to the system administrator.
604fd77f6baSMauro Carvalho Chehab
605fd77f6baSMauro Carvalho Chehab
606fd77f6baSMauro Carvalho Chehab- ``size_mb`` - Total memory managed by this csrow attribute file
607fd77f6baSMauro Carvalho Chehab
608fd77f6baSMauro Carvalho Chehab	This attribute file displays, in count of megabytes, the memory
609fd77f6baSMauro Carvalho Chehab	that this csrow contains.
610fd77f6baSMauro Carvalho Chehab
611fd77f6baSMauro Carvalho Chehab
612fd77f6baSMauro Carvalho Chehab- ``mem_type`` - Memory Type attribute file
613fd77f6baSMauro Carvalho Chehab
614fd77f6baSMauro Carvalho Chehab	This attribute file will display what type of memory is currently
615fd77f6baSMauro Carvalho Chehab	on this csrow. Normally, either buffered or unbuffered memory.
616fd77f6baSMauro Carvalho Chehab	Examples:
617fd77f6baSMauro Carvalho Chehab
618fd77f6baSMauro Carvalho Chehab		- Registered-DDR
619fd77f6baSMauro Carvalho Chehab		- Unbuffered-DDR
620fd77f6baSMauro Carvalho Chehab
621fd77f6baSMauro Carvalho Chehab
622fd77f6baSMauro Carvalho Chehab- ``edac_mode`` - EDAC Mode of operation attribute file
623fd77f6baSMauro Carvalho Chehab
624fd77f6baSMauro Carvalho Chehab	This attribute file will display what type of Error detection
625fd77f6baSMauro Carvalho Chehab	and correction is being utilized.
626fd77f6baSMauro Carvalho Chehab
627fd77f6baSMauro Carvalho Chehab
628fd77f6baSMauro Carvalho Chehab- ``dev_type`` - Device type attribute file
629fd77f6baSMauro Carvalho Chehab
630fd77f6baSMauro Carvalho Chehab	This attribute file will display what type of DRAM device is
631fd77f6baSMauro Carvalho Chehab	being utilized on this DIMM.
632fd77f6baSMauro Carvalho Chehab	Examples:
633fd77f6baSMauro Carvalho Chehab
634fd77f6baSMauro Carvalho Chehab		- x1
635fd77f6baSMauro Carvalho Chehab		- x2
636fd77f6baSMauro Carvalho Chehab		- x4
637fd77f6baSMauro Carvalho Chehab		- x8
638fd77f6baSMauro Carvalho Chehab
639fd77f6baSMauro Carvalho Chehab
640fd77f6baSMauro Carvalho Chehab- ``ch0_ce_count`` - Channel 0 CE Count attribute file
641fd77f6baSMauro Carvalho Chehab
642fd77f6baSMauro Carvalho Chehab	This attribute file will display the count of CEs on this
643fd77f6baSMauro Carvalho Chehab	DIMM located in channel 0.
644fd77f6baSMauro Carvalho Chehab
645fd77f6baSMauro Carvalho Chehab
646fd77f6baSMauro Carvalho Chehab- ``ch0_ue_count`` - Channel 0 UE Count attribute file
647fd77f6baSMauro Carvalho Chehab
648fd77f6baSMauro Carvalho Chehab	This attribute file will display the count of UEs on this
649fd77f6baSMauro Carvalho Chehab	DIMM located in channel 0.
650fd77f6baSMauro Carvalho Chehab
651fd77f6baSMauro Carvalho Chehab
652fd77f6baSMauro Carvalho Chehab- ``ch0_dimm_label`` - Channel 0 DIMM Label control file
653fd77f6baSMauro Carvalho Chehab
654fd77f6baSMauro Carvalho Chehab
655fd77f6baSMauro Carvalho Chehab	This control file allows this DIMM to have a label assigned
656fd77f6baSMauro Carvalho Chehab	to it. With this label in the module, when errors occur
657fd77f6baSMauro Carvalho Chehab	the output can provide the DIMM label in the system log.
658fd77f6baSMauro Carvalho Chehab	This becomes vital for panic events to isolate the
659fd77f6baSMauro Carvalho Chehab	cause of the UE event.
660fd77f6baSMauro Carvalho Chehab
661fd77f6baSMauro Carvalho Chehab	DIMM Labels must be assigned after booting, with information
662fd77f6baSMauro Carvalho Chehab	that correctly identifies the physical slot with its
663fd77f6baSMauro Carvalho Chehab	silk screen label. This information is currently very
664fd77f6baSMauro Carvalho Chehab	motherboard specific and determination of this information
665fd77f6baSMauro Carvalho Chehab	must occur in userland at this time.
666fd77f6baSMauro Carvalho Chehab
667fd77f6baSMauro Carvalho Chehab
668fd77f6baSMauro Carvalho Chehab- ``ch1_ce_count`` - Channel 1 CE Count attribute file
669fd77f6baSMauro Carvalho Chehab
670fd77f6baSMauro Carvalho Chehab
671fd77f6baSMauro Carvalho Chehab	This attribute file will display the count of CEs on this
672fd77f6baSMauro Carvalho Chehab	DIMM located in channel 1.
673fd77f6baSMauro Carvalho Chehab
674fd77f6baSMauro Carvalho Chehab
675fd77f6baSMauro Carvalho Chehab- ``ch1_ue_count`` - Channel 1 UE Count attribute file
676fd77f6baSMauro Carvalho Chehab
677fd77f6baSMauro Carvalho Chehab
678fd77f6baSMauro Carvalho Chehab	This attribute file will display the count of UEs on this
679fd77f6baSMauro Carvalho Chehab	DIMM located in channel 0.
680fd77f6baSMauro Carvalho Chehab
681fd77f6baSMauro Carvalho Chehab
682fd77f6baSMauro Carvalho Chehab- ``ch1_dimm_label`` - Channel 1 DIMM Label control file
683fd77f6baSMauro Carvalho Chehab
684fd77f6baSMauro Carvalho Chehab	This control file allows this DIMM to have a label assigned
685fd77f6baSMauro Carvalho Chehab	to it. With this label in the module, when errors occur
686fd77f6baSMauro Carvalho Chehab	the output can provide the DIMM label in the system log.
687fd77f6baSMauro Carvalho Chehab	This becomes vital for panic events to isolate the
688fd77f6baSMauro Carvalho Chehab	cause of the UE event.
689fd77f6baSMauro Carvalho Chehab
690fd77f6baSMauro Carvalho Chehab	DIMM Labels must be assigned after booting, with information
691fd77f6baSMauro Carvalho Chehab	that correctly identifies the physical slot with its
692fd77f6baSMauro Carvalho Chehab	silk screen label. This information is currently very
693fd77f6baSMauro Carvalho Chehab	motherboard specific and determination of this information
694fd77f6baSMauro Carvalho Chehab	must occur in userland at this time.
695fd77f6baSMauro Carvalho Chehab
696fd77f6baSMauro Carvalho Chehab
697fd77f6baSMauro Carvalho ChehabSystem Logging
698fd77f6baSMauro Carvalho Chehab--------------
699fd77f6baSMauro Carvalho Chehab
700fd77f6baSMauro Carvalho ChehabIf logging for UEs and CEs is enabled, then system logs will contain
701fd77f6baSMauro Carvalho Chehabinformation indicating that errors have been detected::
702fd77f6baSMauro Carvalho Chehab
703fd77f6baSMauro Carvalho Chehab  EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, channel 1 "DIMM_B1": amd76x_edac
704fd77f6baSMauro Carvalho Chehab  EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0, channel 1 "DIMM_B1": amd76x_edac
705fd77f6baSMauro Carvalho Chehab
706fd77f6baSMauro Carvalho Chehab
707fd77f6baSMauro Carvalho ChehabThe structure of the message is:
708fd77f6baSMauro Carvalho Chehab
709fd77f6baSMauro Carvalho Chehab	+---------------------------------------+-------------+
71082a19551SJonathan Corbet	| Content                               | Example     |
711fd77f6baSMauro Carvalho Chehab	+=======================================+=============+
712fd77f6baSMauro Carvalho Chehab	| The memory controller                 | MC0         |
713fd77f6baSMauro Carvalho Chehab	+---------------------------------------+-------------+
714fd77f6baSMauro Carvalho Chehab	| Error type                            | CE          |
715fd77f6baSMauro Carvalho Chehab	+---------------------------------------+-------------+
716fd77f6baSMauro Carvalho Chehab	| Memory page                           | 0x283       |
717fd77f6baSMauro Carvalho Chehab	+---------------------------------------+-------------+
718fd77f6baSMauro Carvalho Chehab	| Offset in the page                    | 0xce0       |
719fd77f6baSMauro Carvalho Chehab	+---------------------------------------+-------------+
720fd77f6baSMauro Carvalho Chehab	| The byte granularity                  | grain 8     |
721fd77f6baSMauro Carvalho Chehab	| or resolution of the error            |             |
722fd77f6baSMauro Carvalho Chehab	+---------------------------------------+-------------+
723fd77f6baSMauro Carvalho Chehab	| The error syndrome                    | 0xb741      |
724fd77f6baSMauro Carvalho Chehab	+---------------------------------------+-------------+
72582a19551SJonathan Corbet	| Memory row                            | row 0       |
726fd77f6baSMauro Carvalho Chehab	+---------------------------------------+-------------+
727fd77f6baSMauro Carvalho Chehab	| Memory channel                        | channel 1   |
728fd77f6baSMauro Carvalho Chehab	+---------------------------------------+-------------+
729fd77f6baSMauro Carvalho Chehab	| DIMM label, if set prior              | DIMM B1     |
730fd77f6baSMauro Carvalho Chehab	+---------------------------------------+-------------+
731fd77f6baSMauro Carvalho Chehab	| And then an optional, driver-specific |             |
732fd77f6baSMauro Carvalho Chehab	| message that may have additional      |             |
733fd77f6baSMauro Carvalho Chehab	| information.                          |             |
734fd77f6baSMauro Carvalho Chehab	+---------------------------------------+-------------+
735fd77f6baSMauro Carvalho Chehab
736fd77f6baSMauro Carvalho ChehabBoth UEs and CEs with no info will lack all but memory controller, error
737fd77f6baSMauro Carvalho Chehabtype, a notice of "no info" and then an optional, driver-specific error
738fd77f6baSMauro Carvalho Chehabmessage.
739fd77f6baSMauro Carvalho Chehab
740fd77f6baSMauro Carvalho Chehab
741fd77f6baSMauro Carvalho ChehabPCI Bus Parity Detection
742fd77f6baSMauro Carvalho Chehab------------------------
743fd77f6baSMauro Carvalho Chehab
744fd77f6baSMauro Carvalho ChehabOn Header Type 00 devices, the primary status is looked at for any
745fd77f6baSMauro Carvalho Chehabparity error regardless of whether parity is enabled on the device or
746fd77f6baSMauro Carvalho Chehabnot. (The spec indicates parity is generated in some cases). On Header
747fd77f6baSMauro Carvalho ChehabType 01 bridges, the secondary status register is also looked at to see
748fd77f6baSMauro Carvalho Chehabif parity occurred on the bus on the other side of the bridge.
749fd77f6baSMauro Carvalho Chehab
750fd77f6baSMauro Carvalho Chehab
751fd77f6baSMauro Carvalho ChehabSysfs configuration
752fd77f6baSMauro Carvalho Chehab-------------------
753fd77f6baSMauro Carvalho Chehab
754fd77f6baSMauro Carvalho ChehabUnder ``/sys/devices/system/edac/pci`` are control and attribute files as
755fd77f6baSMauro Carvalho Chehabfollows:
756fd77f6baSMauro Carvalho Chehab
757fd77f6baSMauro Carvalho Chehab
758fd77f6baSMauro Carvalho Chehab- ``check_pci_parity`` - Enable/Disable PCI Parity checking control file
759fd77f6baSMauro Carvalho Chehab
760fd77f6baSMauro Carvalho Chehab	This control file enables or disables the PCI Bus Parity scanning
761fd77f6baSMauro Carvalho Chehab	operation. Writing a 1 to this file enables the scanning. Writing
762fd77f6baSMauro Carvalho Chehab	a 0 to this file disables the scanning.
763fd77f6baSMauro Carvalho Chehab
764fd77f6baSMauro Carvalho Chehab	Enable::
765fd77f6baSMauro Carvalho Chehab
766fd77f6baSMauro Carvalho Chehab		echo "1" >/sys/devices/system/edac/pci/check_pci_parity
767fd77f6baSMauro Carvalho Chehab
768fd77f6baSMauro Carvalho Chehab	Disable::
769fd77f6baSMauro Carvalho Chehab
770fd77f6baSMauro Carvalho Chehab		echo "0" >/sys/devices/system/edac/pci/check_pci_parity
771fd77f6baSMauro Carvalho Chehab
772fd77f6baSMauro Carvalho Chehab
773fd77f6baSMauro Carvalho Chehab- ``pci_parity_count`` - Parity Count
774fd77f6baSMauro Carvalho Chehab
775fd77f6baSMauro Carvalho Chehab	This attribute file will display the number of parity errors that
776fd77f6baSMauro Carvalho Chehab	have been detected.
777fd77f6baSMauro Carvalho Chehab
778fd77f6baSMauro Carvalho Chehab
779fd77f6baSMauro Carvalho ChehabModule parameters
780fd77f6baSMauro Carvalho Chehab-----------------
781fd77f6baSMauro Carvalho Chehab
782fd77f6baSMauro Carvalho Chehab- ``edac_mc_panic_on_ue`` - Panic on UE control file
783fd77f6baSMauro Carvalho Chehab
784fd77f6baSMauro Carvalho Chehab	An uncorrectable error will cause a machine panic.  This is usually
785fd77f6baSMauro Carvalho Chehab	desirable.  It is a bad idea to continue when an uncorrectable error
786fd77f6baSMauro Carvalho Chehab	occurs - it is indeterminate what was uncorrected and the operating
787fd77f6baSMauro Carvalho Chehab	system context might be so mangled that continuing will lead to further
788fd77f6baSMauro Carvalho Chehab	corruption. If the kernel has MCE configured, then EDAC will never
789fd77f6baSMauro Carvalho Chehab	notice the UE.
790fd77f6baSMauro Carvalho Chehab
791fd77f6baSMauro Carvalho Chehab	LOAD TIME::
792fd77f6baSMauro Carvalho Chehab
793fd77f6baSMauro Carvalho Chehab		module/kernel parameter: edac_mc_panic_on_ue=[0|1]
794fd77f6baSMauro Carvalho Chehab
795fd77f6baSMauro Carvalho Chehab	RUN TIME::
796fd77f6baSMauro Carvalho Chehab
797fd77f6baSMauro Carvalho Chehab		echo "1" > /sys/module/edac_core/parameters/edac_mc_panic_on_ue
798fd77f6baSMauro Carvalho Chehab
799fd77f6baSMauro Carvalho Chehab
800fd77f6baSMauro Carvalho Chehab- ``edac_mc_log_ue`` - Log UE control file
801fd77f6baSMauro Carvalho Chehab
802fd77f6baSMauro Carvalho Chehab
803fd77f6baSMauro Carvalho Chehab	Generate kernel messages describing uncorrectable errors.  These errors
804fd77f6baSMauro Carvalho Chehab	are reported through the system message log system.  UE statistics
805fd77f6baSMauro Carvalho Chehab	will be accumulated even when UE logging is disabled.
806fd77f6baSMauro Carvalho Chehab
807fd77f6baSMauro Carvalho Chehab	LOAD TIME::
808fd77f6baSMauro Carvalho Chehab
809fd77f6baSMauro Carvalho Chehab		module/kernel parameter: edac_mc_log_ue=[0|1]
810fd77f6baSMauro Carvalho Chehab
811fd77f6baSMauro Carvalho Chehab	RUN TIME::
812fd77f6baSMauro Carvalho Chehab
813fd77f6baSMauro Carvalho Chehab		echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ue
814fd77f6baSMauro Carvalho Chehab
815fd77f6baSMauro Carvalho Chehab
816fd77f6baSMauro Carvalho Chehab- ``edac_mc_log_ce`` - Log CE control file
817fd77f6baSMauro Carvalho Chehab
818fd77f6baSMauro Carvalho Chehab
819fd77f6baSMauro Carvalho Chehab	Generate kernel messages describing correctable errors.  These
820fd77f6baSMauro Carvalho Chehab	errors are reported through the system message log system.
821fd77f6baSMauro Carvalho Chehab	CE statistics will be accumulated even when CE logging is disabled.
822fd77f6baSMauro Carvalho Chehab
823fd77f6baSMauro Carvalho Chehab	LOAD TIME::
824fd77f6baSMauro Carvalho Chehab
825fd77f6baSMauro Carvalho Chehab		module/kernel parameter: edac_mc_log_ce=[0|1]
826fd77f6baSMauro Carvalho Chehab
827fd77f6baSMauro Carvalho Chehab	RUN TIME::
828fd77f6baSMauro Carvalho Chehab
829fd77f6baSMauro Carvalho Chehab		echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ce
830fd77f6baSMauro Carvalho Chehab
831fd77f6baSMauro Carvalho Chehab
832fd77f6baSMauro Carvalho Chehab- ``edac_mc_poll_msec`` - Polling period control file
833fd77f6baSMauro Carvalho Chehab
834fd77f6baSMauro Carvalho Chehab
835fd77f6baSMauro Carvalho Chehab	The time period, in milliseconds, for polling for error information.
836fd77f6baSMauro Carvalho Chehab	Too small a value wastes resources.  Too large a value might delay
837fd77f6baSMauro Carvalho Chehab	necessary handling of errors and might loose valuable information for
838fd77f6baSMauro Carvalho Chehab	locating the error.  1000 milliseconds (once each second) is the current
839fd77f6baSMauro Carvalho Chehab	default. Systems which require all the bandwidth they can get, may
840fd77f6baSMauro Carvalho Chehab	increase this.
841fd77f6baSMauro Carvalho Chehab
842fd77f6baSMauro Carvalho Chehab	LOAD TIME::
843fd77f6baSMauro Carvalho Chehab
844fd77f6baSMauro Carvalho Chehab		module/kernel parameter: edac_mc_poll_msec=[0|1]
845fd77f6baSMauro Carvalho Chehab
846fd77f6baSMauro Carvalho Chehab	RUN TIME::
847fd77f6baSMauro Carvalho Chehab
848fd77f6baSMauro Carvalho Chehab		echo "1000" > /sys/module/edac_core/parameters/edac_mc_poll_msec
849fd77f6baSMauro Carvalho Chehab
850fd77f6baSMauro Carvalho Chehab
851fd77f6baSMauro Carvalho Chehab- ``panic_on_pci_parity`` - Panic on PCI PARITY Error
852fd77f6baSMauro Carvalho Chehab
853fd77f6baSMauro Carvalho Chehab
854fd77f6baSMauro Carvalho Chehab	This control file enables or disables panicking when a parity
855fd77f6baSMauro Carvalho Chehab	error has been detected.
856fd77f6baSMauro Carvalho Chehab
857fd77f6baSMauro Carvalho Chehab
858fd77f6baSMauro Carvalho Chehab	module/kernel parameter::
859fd77f6baSMauro Carvalho Chehab
860fd77f6baSMauro Carvalho Chehab			edac_panic_on_pci_pe=[0|1]
861fd77f6baSMauro Carvalho Chehab
862fd77f6baSMauro Carvalho Chehab	Enable::
863fd77f6baSMauro Carvalho Chehab
864fd77f6baSMauro Carvalho Chehab		echo "1" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe
865fd77f6baSMauro Carvalho Chehab
866fd77f6baSMauro Carvalho Chehab	Disable::
867fd77f6baSMauro Carvalho Chehab
868fd77f6baSMauro Carvalho Chehab		echo "0" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe
869fd77f6baSMauro Carvalho Chehab
870fd77f6baSMauro Carvalho Chehab
871fd77f6baSMauro Carvalho Chehab
872fd77f6baSMauro Carvalho ChehabEDAC device type
873fd77f6baSMauro Carvalho Chehab----------------
874fd77f6baSMauro Carvalho Chehab
87566c222a0SMauro Carvalho ChehabIn the header file, edac_pci.h, there is a series of edac_device structures
876fd77f6baSMauro Carvalho Chehaband APIs for the EDAC_DEVICE.
877fd77f6baSMauro Carvalho Chehab
878fd77f6baSMauro Carvalho ChehabUser space access to an edac_device is through the sysfs interface.
879fd77f6baSMauro Carvalho Chehab
880fd77f6baSMauro Carvalho ChehabAt the location ``/sys/devices/system/edac`` (sysfs) new edac_device devices
881fd77f6baSMauro Carvalho Chehabwill appear.
882fd77f6baSMauro Carvalho Chehab
883fd77f6baSMauro Carvalho ChehabThere is a three level tree beneath the above ``edac`` directory. For example,
884fd77f6baSMauro Carvalho Chehabthe ``test_device_edac`` device (found at the http://bluesmoke.sourceforget.net
885fd77f6baSMauro Carvalho Chehabwebsite) installs itself as::
886fd77f6baSMauro Carvalho Chehab
887fd77f6baSMauro Carvalho Chehab	/sys/devices/system/edac/test-instance
888fd77f6baSMauro Carvalho Chehab
889fd77f6baSMauro Carvalho Chehabin this directory are various controls, a symlink and one or more ``instance``
890fd77f6baSMauro Carvalho Chehabdirectories.
891fd77f6baSMauro Carvalho Chehab
892fd77f6baSMauro Carvalho ChehabThe standard default controls are:
893fd77f6baSMauro Carvalho Chehab
894fd77f6baSMauro Carvalho Chehab	==============	=======================================================
895fd77f6baSMauro Carvalho Chehab	log_ce		boolean to log CE events
896fd77f6baSMauro Carvalho Chehab	log_ue		boolean to log UE events
897fd77f6baSMauro Carvalho Chehab	panic_on_ue	boolean to ``panic`` the system if an UE is encountered
898fd77f6baSMauro Carvalho Chehab			(default off, can be set true via startup script)
899fd77f6baSMauro Carvalho Chehab	poll_msec	time period between POLL cycles for events
900fd77f6baSMauro Carvalho Chehab	==============	=======================================================
901fd77f6baSMauro Carvalho Chehab
902fd77f6baSMauro Carvalho ChehabThe test_device_edac device adds at least one of its own custom control:
903fd77f6baSMauro Carvalho Chehab
904fd77f6baSMauro Carvalho Chehab	==============	==================================================
905fd77f6baSMauro Carvalho Chehab	test_bits	which in the current test driver does nothing but
906fd77f6baSMauro Carvalho Chehab			show how it is installed. A ported driver can
907fd77f6baSMauro Carvalho Chehab			add one or more such controls and/or attributes
908fd77f6baSMauro Carvalho Chehab			for specific uses.
909fd77f6baSMauro Carvalho Chehab			One out-of-tree driver uses controls here to allow
910fd77f6baSMauro Carvalho Chehab			for ERROR INJECTION operations to hardware
911fd77f6baSMauro Carvalho Chehab			injection registers
912fd77f6baSMauro Carvalho Chehab	==============	==================================================
913fd77f6baSMauro Carvalho Chehab
914fd77f6baSMauro Carvalho ChehabThe symlink points to the 'struct dev' that is registered for this edac_device.
915fd77f6baSMauro Carvalho Chehab
916fd77f6baSMauro Carvalho ChehabInstances
917fd77f6baSMauro Carvalho Chehab---------
918fd77f6baSMauro Carvalho Chehab
919fd77f6baSMauro Carvalho ChehabOne or more instance directories are present. For the ``test_device_edac``
920fd77f6baSMauro Carvalho Chehabcase:
921fd77f6baSMauro Carvalho Chehab
922fd77f6baSMauro Carvalho Chehab	+----------------+
923fd77f6baSMauro Carvalho Chehab	| test-instance0 |
924fd77f6baSMauro Carvalho Chehab	+----------------+
925fd77f6baSMauro Carvalho Chehab
926fd77f6baSMauro Carvalho Chehab
927fd77f6baSMauro Carvalho ChehabIn this directory there are two default counter attributes, which are totals of
928fd77f6baSMauro Carvalho Chehabcounter in deeper subdirectories.
929fd77f6baSMauro Carvalho Chehab
930fd77f6baSMauro Carvalho Chehab	==============	====================================
931fd77f6baSMauro Carvalho Chehab	ce_count	total of CE events of subdirectories
932fd77f6baSMauro Carvalho Chehab	ue_count	total of UE events of subdirectories
933fd77f6baSMauro Carvalho Chehab	==============	====================================
934fd77f6baSMauro Carvalho Chehab
935fd77f6baSMauro Carvalho ChehabBlocks
936fd77f6baSMauro Carvalho Chehab------
937fd77f6baSMauro Carvalho Chehab
938fd77f6baSMauro Carvalho ChehabAt the lowest directory level is the ``block`` directory. There can be 0, 1
939fd77f6baSMauro Carvalho Chehabor more blocks specified in each instance:
940fd77f6baSMauro Carvalho Chehab
941fd77f6baSMauro Carvalho Chehab	+-------------+
942fd77f6baSMauro Carvalho Chehab	| test-block0 |
943fd77f6baSMauro Carvalho Chehab	+-------------+
944fd77f6baSMauro Carvalho Chehab
945fd77f6baSMauro Carvalho ChehabIn this directory the default attributes are:
946fd77f6baSMauro Carvalho Chehab
947fd77f6baSMauro Carvalho Chehab	==============	================================================
948fd77f6baSMauro Carvalho Chehab	ce_count	which is counter of CE events for this ``block``
949fd77f6baSMauro Carvalho Chehab			of hardware being monitored
950fd77f6baSMauro Carvalho Chehab	ue_count	which is counter of UE events for this ``block``
951fd77f6baSMauro Carvalho Chehab			of hardware being monitored
952fd77f6baSMauro Carvalho Chehab	==============	================================================
953fd77f6baSMauro Carvalho Chehab
954fd77f6baSMauro Carvalho Chehab
955fd77f6baSMauro Carvalho ChehabThe ``test_device_edac`` device adds 4 attributes and 1 control:
956fd77f6baSMauro Carvalho Chehab
957fd77f6baSMauro Carvalho Chehab	================== ====================================================
958fd77f6baSMauro Carvalho Chehab	test-block-bits-0	for every POLL cycle this counter
959fd77f6baSMauro Carvalho Chehab				is incremented
960fd77f6baSMauro Carvalho Chehab	test-block-bits-1	every 10 cycles, this counter is bumped once,
961fd77f6baSMauro Carvalho Chehab				and test-block-bits-0 is set to 0
962fd77f6baSMauro Carvalho Chehab	test-block-bits-2	every 100 cycles, this counter is bumped once,
963fd77f6baSMauro Carvalho Chehab				and test-block-bits-1 is set to 0
964fd77f6baSMauro Carvalho Chehab	test-block-bits-3	every 1000 cycles, this counter is bumped once,
965fd77f6baSMauro Carvalho Chehab				and test-block-bits-2 is set to 0
966fd77f6baSMauro Carvalho Chehab	================== ====================================================
967fd77f6baSMauro Carvalho Chehab
968fd77f6baSMauro Carvalho Chehab
969fd77f6baSMauro Carvalho Chehab	================== ====================================================
970fd77f6baSMauro Carvalho Chehab	reset-counters		writing ANY thing to this control will
971fd77f6baSMauro Carvalho Chehab				reset all the above counters.
972fd77f6baSMauro Carvalho Chehab	================== ====================================================
973fd77f6baSMauro Carvalho Chehab
974fd77f6baSMauro Carvalho Chehab
975fd77f6baSMauro Carvalho ChehabUse of the ``test_device_edac`` driver should enable any others to create their own
976fd77f6baSMauro Carvalho Chehabunique drivers for their hardware systems.
977fd77f6baSMauro Carvalho Chehab
978fd77f6baSMauro Carvalho ChehabThe ``test_device_edac`` sample driver is located at the
979fd77f6baSMauro Carvalho Chehabhttp://bluesmoke.sourceforge.net project site for EDAC.
980fd77f6baSMauro Carvalho Chehab
981fd77f6baSMauro Carvalho Chehab
982fd77f6baSMauro Carvalho ChehabUsage of EDAC APIs on Nehalem and newer Intel CPUs
983fd77f6baSMauro Carvalho Chehab--------------------------------------------------
984fd77f6baSMauro Carvalho Chehab
985fd77f6baSMauro Carvalho ChehabOn older Intel architectures, the memory controller was part of the North
986fd77f6baSMauro Carvalho ChehabBridge chipset. Nehalem, Sandy Bridge, Ivy Bridge, Haswell, Sky Lake and
987fd77f6baSMauro Carvalho Chehabnewer Intel architectures integrated an enhanced version of the memory
988fd77f6baSMauro Carvalho Chehabcontroller (MC) inside the CPUs.
989fd77f6baSMauro Carvalho Chehab
990fd77f6baSMauro Carvalho ChehabThis chapter will cover the differences of the enhanced memory controllers
991fd77f6baSMauro Carvalho Chehabfound on newer Intel CPUs, such as ``i7core_edac``, ``sb_edac`` and
992fd77f6baSMauro Carvalho Chehab``sbx_edac`` drivers.
993fd77f6baSMauro Carvalho Chehab
994fd77f6baSMauro Carvalho Chehab.. note::
995fd77f6baSMauro Carvalho Chehab
996fd77f6baSMauro Carvalho Chehab   The Xeon E7 processor families use a separate chip for the memory
997fd77f6baSMauro Carvalho Chehab   controller, called Intel Scalable Memory Buffer. This section doesn't
998fd77f6baSMauro Carvalho Chehab   apply for such families.
999fd77f6baSMauro Carvalho Chehab
1000fd77f6baSMauro Carvalho Chehab1) There is one Memory Controller per Quick Patch Interconnect
1001fd77f6baSMauro Carvalho Chehab   (QPI). At the driver, the term "socket" means one QPI. This is
1002fd77f6baSMauro Carvalho Chehab   associated with a physical CPU socket.
1003fd77f6baSMauro Carvalho Chehab
1004fd77f6baSMauro Carvalho Chehab   Each MC have 3 physical read channels, 3 physical write channels and
1005fd77f6baSMauro Carvalho Chehab   3 logic channels. The driver currently sees it as just 3 channels.
1006fd77f6baSMauro Carvalho Chehab   Each channel can have up to 3 DIMMs.
1007fd77f6baSMauro Carvalho Chehab
1008fd77f6baSMauro Carvalho Chehab   The minimum known unity is DIMMs. There are no information about csrows.
1009fd77f6baSMauro Carvalho Chehab   As EDAC API maps the minimum unity is csrows, the driver sequentially
1010fd77f6baSMauro Carvalho Chehab   maps channel/DIMM into different csrows.
1011fd77f6baSMauro Carvalho Chehab
1012fd77f6baSMauro Carvalho Chehab   For example, supposing the following layout::
1013fd77f6baSMauro Carvalho Chehab
1014fd77f6baSMauro Carvalho Chehab	Ch0 phy rd0, wr0 (0x063f4031): 2 ranks, UDIMMs
1015fd77f6baSMauro Carvalho Chehab	  dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
1016fd77f6baSMauro Carvalho Chehab	  dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400
1017fd77f6baSMauro Carvalho Chehab        Ch1 phy rd1, wr1 (0x063f4031): 2 ranks, UDIMMs
1018fd77f6baSMauro Carvalho Chehab	  dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
1019fd77f6baSMauro Carvalho Chehab	Ch2 phy rd3, wr3 (0x063f4031): 2 ranks, UDIMMs
1020fd77f6baSMauro Carvalho Chehab	  dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
1021fd77f6baSMauro Carvalho Chehab
1022fd77f6baSMauro Carvalho Chehab   The driver will map it as::
1023fd77f6baSMauro Carvalho Chehab
1024fd77f6baSMauro Carvalho Chehab	csrow0: channel 0, dimm0
1025fd77f6baSMauro Carvalho Chehab	csrow1: channel 0, dimm1
1026fd77f6baSMauro Carvalho Chehab	csrow2: channel 1, dimm0
1027fd77f6baSMauro Carvalho Chehab	csrow3: channel 2, dimm0
1028fd77f6baSMauro Carvalho Chehab
1029fd77f6baSMauro Carvalho Chehab   exports one DIMM per csrow.
1030fd77f6baSMauro Carvalho Chehab
1031fd77f6baSMauro Carvalho Chehab   Each QPI is exported as a different memory controller.
1032fd77f6baSMauro Carvalho Chehab
1033fd77f6baSMauro Carvalho Chehab2) The MC has the ability to inject errors to test drivers. The drivers
1034fd77f6baSMauro Carvalho Chehab   implement this functionality via some error injection nodes:
1035fd77f6baSMauro Carvalho Chehab
1036fd77f6baSMauro Carvalho Chehab   For injecting a memory error, there are some sysfs nodes, under
1037fd77f6baSMauro Carvalho Chehab   ``/sys/devices/system/edac/mc/mc?/``:
1038fd77f6baSMauro Carvalho Chehab
1039fd77f6baSMauro Carvalho Chehab   - ``inject_addrmatch/*``:
1040fd77f6baSMauro Carvalho Chehab      Controls the error injection mask register. It is possible to specify
1041fd77f6baSMauro Carvalho Chehab      several characteristics of the address to match an error code::
1042fd77f6baSMauro Carvalho Chehab
1043fd77f6baSMauro Carvalho Chehab         dimm = the affected dimm. Numbers are relative to a channel;
1044fd77f6baSMauro Carvalho Chehab         rank = the memory rank;
1045fd77f6baSMauro Carvalho Chehab         channel = the channel that will generate an error;
1046fd77f6baSMauro Carvalho Chehab         bank = the affected bank;
1047fd77f6baSMauro Carvalho Chehab         page = the page address;
1048fd77f6baSMauro Carvalho Chehab         column (or col) = the address column.
1049fd77f6baSMauro Carvalho Chehab
1050fd77f6baSMauro Carvalho Chehab      each of the above values can be set to "any" to match any valid value.
1051fd77f6baSMauro Carvalho Chehab
1052fd77f6baSMauro Carvalho Chehab      At driver init, all values are set to any.
1053fd77f6baSMauro Carvalho Chehab
1054fd77f6baSMauro Carvalho Chehab      For example, to generate an error at rank 1 of dimm 2, for any channel,
1055fd77f6baSMauro Carvalho Chehab      any bank, any page, any column::
1056fd77f6baSMauro Carvalho Chehab
1057fd77f6baSMauro Carvalho Chehab		echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm
1058fd77f6baSMauro Carvalho Chehab		echo 1 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank
1059fd77f6baSMauro Carvalho Chehab
1060fd77f6baSMauro Carvalho Chehab	To return to the default behaviour of matching any, you can do::
1061fd77f6baSMauro Carvalho Chehab
1062fd77f6baSMauro Carvalho Chehab		echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm
1063fd77f6baSMauro Carvalho Chehab		echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank
1064fd77f6baSMauro Carvalho Chehab
1065fd77f6baSMauro Carvalho Chehab   - ``inject_eccmask``:
1066fd77f6baSMauro Carvalho Chehab          specifies what bits will have troubles,
1067fd77f6baSMauro Carvalho Chehab
1068fd77f6baSMauro Carvalho Chehab   - ``inject_section``:
1069fd77f6baSMauro Carvalho Chehab       specifies what ECC cache section will get the error::
1070fd77f6baSMauro Carvalho Chehab
1071fd77f6baSMauro Carvalho Chehab		3 for both
1072fd77f6baSMauro Carvalho Chehab		2 for the highest
1073fd77f6baSMauro Carvalho Chehab		1 for the lowest
1074fd77f6baSMauro Carvalho Chehab
1075fd77f6baSMauro Carvalho Chehab   - ``inject_type``:
1076fd77f6baSMauro Carvalho Chehab       specifies the type of error, being a combination of the following bits::
1077fd77f6baSMauro Carvalho Chehab
1078fd77f6baSMauro Carvalho Chehab		bit 0 - repeat
1079fd77f6baSMauro Carvalho Chehab		bit 1 - ecc
1080fd77f6baSMauro Carvalho Chehab		bit 2 - parity
1081fd77f6baSMauro Carvalho Chehab
1082fd77f6baSMauro Carvalho Chehab   - ``inject_enable``:
1083fd77f6baSMauro Carvalho Chehab       starts the error generation when something different than 0 is written.
1084fd77f6baSMauro Carvalho Chehab
1085fd77f6baSMauro Carvalho Chehab   All inject vars can be read. root permission is needed for write.
1086fd77f6baSMauro Carvalho Chehab
1087fd77f6baSMauro Carvalho Chehab   Datasheet states that the error will only be generated after a write on an
1088fd77f6baSMauro Carvalho Chehab   address that matches inject_addrmatch. It seems, however, that reading will
1089fd77f6baSMauro Carvalho Chehab   also produce an error.
1090fd77f6baSMauro Carvalho Chehab
1091fd77f6baSMauro Carvalho Chehab   For example, the following code will generate an error for any write access
1092fd77f6baSMauro Carvalho Chehab   at socket 0, on any DIMM/address on channel 2::
1093fd77f6baSMauro Carvalho Chehab
1094fd77f6baSMauro Carvalho Chehab	echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/channel
1095fd77f6baSMauro Carvalho Chehab	echo 2 >/sys/devices/system/edac/mc/mc0/inject_type
1096fd77f6baSMauro Carvalho Chehab	echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask
1097fd77f6baSMauro Carvalho Chehab	echo 3 >/sys/devices/system/edac/mc/mc0/inject_section
1098fd77f6baSMauro Carvalho Chehab	echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable
1099fd77f6baSMauro Carvalho Chehab	dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null
1100fd77f6baSMauro Carvalho Chehab
1101fd77f6baSMauro Carvalho Chehab   For socket 1, it is needed to replace "mc0" by "mc1" at the above
1102fd77f6baSMauro Carvalho Chehab   commands.
1103fd77f6baSMauro Carvalho Chehab
1104fd77f6baSMauro Carvalho Chehab   The generated error message will look like::
1105fd77f6baSMauro Carvalho Chehab
1106fd77f6baSMauro Carvalho Chehab	EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error))
1107fd77f6baSMauro Carvalho Chehab
1108fd77f6baSMauro Carvalho Chehab3) Corrected Error memory register counters
1109fd77f6baSMauro Carvalho Chehab
1110fd77f6baSMauro Carvalho Chehab   Those newer MCs have some registers to count memory errors. The driver
1111fd77f6baSMauro Carvalho Chehab   uses those registers to report Corrected Errors on devices with Registered
1112fd77f6baSMauro Carvalho Chehab   DIMMs.
1113fd77f6baSMauro Carvalho Chehab
1114fd77f6baSMauro Carvalho Chehab   However, those counters don't work with Unregistered DIMM. As the chipset
1115fd77f6baSMauro Carvalho Chehab   offers some counters that also work with UDIMMs (but with a worse level of
1116fd77f6baSMauro Carvalho Chehab   granularity than the default ones), the driver exposes those registers for
1117fd77f6baSMauro Carvalho Chehab   UDIMM memories.
1118fd77f6baSMauro Carvalho Chehab
1119fd77f6baSMauro Carvalho Chehab   They can be read by looking at the contents of ``all_channel_counts/``::
1120fd77f6baSMauro Carvalho Chehab
1121fd77f6baSMauro Carvalho Chehab     $ for i in /sys/devices/system/edac/mc/mc0/all_channel_counts/*; do echo $i; cat $i; done
1122fd77f6baSMauro Carvalho Chehab	/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm0
1123fd77f6baSMauro Carvalho Chehab	0
1124fd77f6baSMauro Carvalho Chehab	/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm1
1125fd77f6baSMauro Carvalho Chehab	0
1126fd77f6baSMauro Carvalho Chehab	/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm2
1127fd77f6baSMauro Carvalho Chehab	0
1128fd77f6baSMauro Carvalho Chehab
1129fd77f6baSMauro Carvalho Chehab   What happens here is that errors on different csrows, but at the same
1130fd77f6baSMauro Carvalho Chehab   dimm number will increment the same counter.
1131fd77f6baSMauro Carvalho Chehab   So, in this memory mapping::
1132fd77f6baSMauro Carvalho Chehab
1133fd77f6baSMauro Carvalho Chehab	csrow0: channel 0, dimm0
1134fd77f6baSMauro Carvalho Chehab	csrow1: channel 0, dimm1
1135fd77f6baSMauro Carvalho Chehab	csrow2: channel 1, dimm0
1136fd77f6baSMauro Carvalho Chehab	csrow3: channel 2, dimm0
1137fd77f6baSMauro Carvalho Chehab
1138fd77f6baSMauro Carvalho Chehab   The hardware will increment udimm0 for an error at the first dimm at either
1139fd77f6baSMauro Carvalho Chehab   csrow0, csrow2  or csrow3;
1140fd77f6baSMauro Carvalho Chehab
1141fd77f6baSMauro Carvalho Chehab   The hardware will increment udimm1 for an error at the second dimm at either
1142fd77f6baSMauro Carvalho Chehab   csrow0, csrow2  or csrow3;
1143fd77f6baSMauro Carvalho Chehab
1144fd77f6baSMauro Carvalho Chehab   The hardware will increment udimm2 for an error at the third dimm at either
1145fd77f6baSMauro Carvalho Chehab   csrow0, csrow2  or csrow3;
1146fd77f6baSMauro Carvalho Chehab
1147fd77f6baSMauro Carvalho Chehab4) Standard error counters
1148fd77f6baSMauro Carvalho Chehab
1149fd77f6baSMauro Carvalho Chehab   The standard error counters are generated when an mcelog error is received
1150fd77f6baSMauro Carvalho Chehab   by the driver. Since, with UDIMM, this is counted by software, it is
1151fd77f6baSMauro Carvalho Chehab   possible that some errors could be lost. With RDIMM's, they display the
1152fd77f6baSMauro Carvalho Chehab   contents of the registers
1153fd77f6baSMauro Carvalho Chehab
1154fd77f6baSMauro Carvalho ChehabReference documents used on ``amd64_edac``
1155fd77f6baSMauro Carvalho Chehab------------------------------------------
1156fd77f6baSMauro Carvalho Chehab
1157fd77f6baSMauro Carvalho Chehab``amd64_edac`` module is based on the following documents
1158fd77f6baSMauro Carvalho Chehab(available from http://support.amd.com/en-us/search/tech-docs):
1159fd77f6baSMauro Carvalho Chehab
1160fd77f6baSMauro Carvalho Chehab1. :Title:  BIOS and Kernel Developer's Guide for AMD Athlon 64 and AMD
1161fd77f6baSMauro Carvalho Chehab	   Opteron Processors
1162fd77f6baSMauro Carvalho Chehab   :AMD publication #: 26094
1163fd77f6baSMauro Carvalho Chehab   :Revision: 3.26
1164fd77f6baSMauro Carvalho Chehab   :Link: http://support.amd.com/TechDocs/26094.PDF
1165fd77f6baSMauro Carvalho Chehab
1166fd77f6baSMauro Carvalho Chehab2. :Title:  BIOS and Kernel Developer's Guide for AMD NPT Family 0Fh
1167fd77f6baSMauro Carvalho Chehab	   Processors
1168fd77f6baSMauro Carvalho Chehab   :AMD publication #: 32559
1169fd77f6baSMauro Carvalho Chehab   :Revision: 3.00
1170fd77f6baSMauro Carvalho Chehab   :Issue Date: May 2006
1171fd77f6baSMauro Carvalho Chehab   :Link: http://support.amd.com/TechDocs/32559.pdf
1172fd77f6baSMauro Carvalho Chehab
1173fd77f6baSMauro Carvalho Chehab3. :Title:  BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h
1174fd77f6baSMauro Carvalho Chehab	   Processors
1175fd77f6baSMauro Carvalho Chehab   :AMD publication #: 31116
1176fd77f6baSMauro Carvalho Chehab   :Revision: 3.00
1177fd77f6baSMauro Carvalho Chehab   :Issue Date: September 07, 2007
1178fd77f6baSMauro Carvalho Chehab   :Link: http://support.amd.com/TechDocs/31116.pdf
1179fd77f6baSMauro Carvalho Chehab
1180fd77f6baSMauro Carvalho Chehab4. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h
1181fd77f6baSMauro Carvalho Chehab	  Models 30h-3Fh Processors
1182fd77f6baSMauro Carvalho Chehab   :AMD publication #: 49125
1183fd77f6baSMauro Carvalho Chehab   :Revision: 3.06
1184fd77f6baSMauro Carvalho Chehab   :Issue Date: 2/12/2015 (latest release)
1185fd77f6baSMauro Carvalho Chehab   :Link: http://support.amd.com/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf
1186fd77f6baSMauro Carvalho Chehab
1187fd77f6baSMauro Carvalho Chehab5. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h
1188fd77f6baSMauro Carvalho Chehab	  Models 60h-6Fh Processors
1189fd77f6baSMauro Carvalho Chehab   :AMD publication #: 50742
1190fd77f6baSMauro Carvalho Chehab   :Revision: 3.01
1191fd77f6baSMauro Carvalho Chehab   :Issue Date: 7/23/2015 (latest release)
1192fd77f6baSMauro Carvalho Chehab   :Link: http://support.amd.com/TechDocs/50742_15h_Models_60h-6Fh_BKDG.pdf
1193fd77f6baSMauro Carvalho Chehab
1194fd77f6baSMauro Carvalho Chehab6. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 16h
1195fd77f6baSMauro Carvalho Chehab	  Models 00h-0Fh Processors
1196fd77f6baSMauro Carvalho Chehab   :AMD publication #: 48751
1197fd77f6baSMauro Carvalho Chehab   :Revision: 3.03
1198fd77f6baSMauro Carvalho Chehab   :Issue Date: 2/23/2015 (latest release)
1199fd77f6baSMauro Carvalho Chehab   :Link: http://support.amd.com/TechDocs/48751_16h_bkdg.pdf
1200fd77f6baSMauro Carvalho Chehab
1201fd77f6baSMauro Carvalho ChehabCredits
1202fd77f6baSMauro Carvalho Chehab=======
1203fd77f6baSMauro Carvalho Chehab
1204fd77f6baSMauro Carvalho Chehab* Written by Doug Thompson <dougthompson@xmission.com>
1205fd77f6baSMauro Carvalho Chehab
1206fd77f6baSMauro Carvalho Chehab  - 7 Dec 2005
1207fd77f6baSMauro Carvalho Chehab  - 17 Jul 2007	Updated
1208fd77f6baSMauro Carvalho Chehab
1209fd77f6baSMauro Carvalho Chehab* |copy| Mauro Carvalho Chehab
1210fd77f6baSMauro Carvalho Chehab
1211fd77f6baSMauro Carvalho Chehab  - 05 Aug 2009	Nehalem interface
1212fd77f6baSMauro Carvalho Chehab  - 26 Oct 2016 Converted to ReST and cleanups at the Nehalem section
1213fd77f6baSMauro Carvalho Chehab
1214fd77f6baSMauro Carvalho Chehab* EDAC authors/maintainers:
1215fd77f6baSMauro Carvalho Chehab
1216fd77f6baSMauro Carvalho Chehab  - Doug Thompson, Dave Jiang, Dave Peterson et al,
1217fd77f6baSMauro Carvalho Chehab  - Mauro Carvalho Chehab
1218fd77f6baSMauro Carvalho Chehab  - Borislav Petkov
1219fd77f6baSMauro Carvalho Chehab  - original author: Thayne Harbaugh
1220