14e37f055SChangbin Du.. SPDX-License-Identifier: GPL-2.0
24e37f055SChangbin Du.. include:: <isonum.txt>
34e37f055SChangbin Du
44e37f055SChangbin Du===========================================================
54e37f055SChangbin DuThe PCI Express Advanced Error Reporting Driver Guide HOWTO
64e37f055SChangbin Du===========================================================
74e37f055SChangbin Du
84e37f055SChangbin Du:Authors: - T. Long Nguyen <tom.l.nguyen@intel.com>
94e37f055SChangbin Du          - Yanmin Zhang <yanmin.zhang@intel.com>
104e37f055SChangbin Du
114e37f055SChangbin Du:Copyright: |copy| 2006 Intel Corporation
124e37f055SChangbin Du
134e37f055SChangbin DuOverview
144e37f055SChangbin Du===========
154e37f055SChangbin Du
164e37f055SChangbin DuAbout this guide
174e37f055SChangbin Du----------------
184e37f055SChangbin Du
19*11502feaSBjorn HelgaasThis guide describes the basics of the PCI Express (PCIe) Advanced Error
204e37f055SChangbin DuReporting (AER) driver and provides information on how to use it, as
21*11502feaSBjorn Helgaaswell as how to enable the drivers of Endpoint devices to conform with
22*11502feaSBjorn Helgaasthe PCIe AER driver.
234e37f055SChangbin Du
244e37f055SChangbin Du
25*11502feaSBjorn HelgaasWhat is the PCIe AER Driver?
26*11502feaSBjorn Helgaas----------------------------
274e37f055SChangbin Du
28*11502feaSBjorn HelgaasPCIe error signaling can occur on the PCIe link itself
29*11502feaSBjorn Helgaasor on behalf of transactions initiated on the link. PCIe
304e37f055SChangbin Dudefines two error reporting paradigms: the baseline capability and
314e37f055SChangbin Duthe Advanced Error Reporting capability. The baseline capability is
32*11502feaSBjorn Helgaasrequired of all PCIe components providing a minimum defined
334e37f055SChangbin Duset of error reporting requirements. Advanced Error Reporting
34*11502feaSBjorn Helgaascapability is implemented with a PCIe Advanced Error Reporting
354e37f055SChangbin Duextended capability structure providing more robust error reporting.
364e37f055SChangbin Du
37*11502feaSBjorn HelgaasThe PCIe AER driver provides the infrastructure to support PCIe Advanced
38*11502feaSBjorn HelgaasError Reporting capability. The PCIe AER driver provides three basic
39*11502feaSBjorn Helgaasfunctions:
404e37f055SChangbin Du
414e37f055SChangbin Du  - Gathers the comprehensive error information if errors occurred.
424e37f055SChangbin Du  - Reports error to the users.
434e37f055SChangbin Du  - Performs error recovery actions.
444e37f055SChangbin Du
45*11502feaSBjorn HelgaasThe AER driver only attaches to Root Ports and RCECs that support the PCIe
46*11502feaSBjorn HelgaasAER capability.
474e37f055SChangbin Du
484e37f055SChangbin Du
494e37f055SChangbin DuUser Guide
504e37f055SChangbin Du==========
514e37f055SChangbin Du
52*11502feaSBjorn HelgaasInclude the PCIe AER Root Driver into the Linux Kernel
53*11502feaSBjorn Helgaas------------------------------------------------------
544e37f055SChangbin Du
55*11502feaSBjorn HelgaasThe PCIe AER driver is a Root Port service driver attached
56*11502feaSBjorn Helgaasvia the PCIe Port Bus driver. If a user wants to use it, the driver
57*11502feaSBjorn Helgaasmust be compiled. It is enabled with CONFIG_PCIEAER, which
58*11502feaSBjorn Helgaasdepends on CONFIG_PCIEPORTBUS.
594e37f055SChangbin Du
60*11502feaSBjorn HelgaasLoad PCIe AER Root Driver
61*11502feaSBjorn Helgaas-------------------------
624e37f055SChangbin Du
634e37f055SChangbin DuSome systems have AER support in firmware. Enabling Linux AER support at
64*11502feaSBjorn Helgaasthe same time the firmware handles AER would result in unpredictable
654e37f055SChangbin Dubehavior. Therefore, Linux does not handle AER events unless the firmware
66*11502feaSBjorn Helgaasgrants AER control to the OS via the ACPI _OSC method. See the PCI Firmware
674e37f055SChangbin DuSpecification for details regarding _OSC usage.
684e37f055SChangbin Du
694e37f055SChangbin DuAER error output
704e37f055SChangbin Du----------------
714e37f055SChangbin Du
724e37f055SChangbin DuWhen a PCIe AER error is captured, an error message will be output to
73*11502feaSBjorn Helgaasconsole. If it's a correctable error, it is output as an info message.
744e37f055SChangbin DuOtherwise, it is printed as an error. So users could choose different
754e37f055SChangbin Dulog level to filter out correctable error messages.
764e37f055SChangbin Du
774e37f055SChangbin DuBelow shows an example::
784e37f055SChangbin Du
794e37f055SChangbin Du  0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID)
804e37f055SChangbin Du  0000:50:00.0:   device [8086:0329] error status/mask=00100000/00000000
814e37f055SChangbin Du  0000:50:00.0:    [20] Unsupported Request    (First)
824e37f055SChangbin Du  0000:50:00.0:   TLP Header: 04000001 00200a03 05010000 00050100
834e37f055SChangbin Du
84*11502feaSBjorn HelgaasIn the example, 'Requester ID' means the ID of the device that sent
85*11502feaSBjorn Helgaasthe error message to the Root Port. Please refer to PCIe specs for other
86*11502feaSBjorn Helgaasfields.
874e37f055SChangbin Du
884e37f055SChangbin DuAER Statistics / Counters
894e37f055SChangbin Du-------------------------
904e37f055SChangbin Du
914e37f055SChangbin DuWhen PCIe AER errors are captured, the counters / statistics are also exposed
924e37f055SChangbin Duin the form of sysfs attributes which are documented at
934e37f055SChangbin DuDocumentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
944e37f055SChangbin Du
954e37f055SChangbin DuDeveloper Guide
964e37f055SChangbin Du===============
974e37f055SChangbin Du
98*11502feaSBjorn HelgaasTo enable error recovery, a software driver must provide callbacks.
994e37f055SChangbin Du
100*11502feaSBjorn HelgaasTo support AER better, developers need to understand how AER works.
1014e37f055SChangbin Du
102*11502feaSBjorn HelgaasPCIe errors are classified into two types: correctable errors
103*11502feaSBjorn Helgaasand uncorrectable errors. This classification is based on the impact
1044e37f055SChangbin Duof those errors, which may result in degraded performance or function
1054e37f055SChangbin Dufailure.
1064e37f055SChangbin Du
1074e37f055SChangbin DuCorrectable errors pose no impacts on the functionality of the
108*11502feaSBjorn Helgaasinterface. The PCIe protocol can recover without any software
1094e37f055SChangbin Duintervention or any loss of data. These errors are detected and
110*11502feaSBjorn Helgaascorrected by hardware.
111*11502feaSBjorn Helgaas
112*11502feaSBjorn HelgaasUnlike correctable errors, uncorrectable
1134e37f055SChangbin Duerrors impact functionality of the interface. Uncorrectable errors
114*11502feaSBjorn Helgaascan cause a particular transaction or a particular PCIe link
1154e37f055SChangbin Duto be unreliable. Depending on those error conditions, uncorrectable
1164e37f055SChangbin Duerrors are further classified into non-fatal errors and fatal errors.
1174e37f055SChangbin DuNon-fatal errors cause the particular transaction to be unreliable,
118*11502feaSBjorn Helgaasbut the PCIe link itself is fully functional. Fatal errors, on
1194e37f055SChangbin Duthe other hand, cause the link to be unreliable.
1204e37f055SChangbin Du
121*11502feaSBjorn HelgaasWhen PCIe error reporting is enabled, a device will automatically send an
122*11502feaSBjorn Helgaaserror message to the Root Port above it when it captures
1234e37f055SChangbin Duan error. The Root Port, upon receiving an error reporting message,
124*11502feaSBjorn Helgaasinternally processes and logs the error message in its AER
125*11502feaSBjorn HelgaasCapability structure. Error information being logged includes storing
1264e37f055SChangbin Duthe error reporting agent's requestor ID into the Error Source
1274e37f055SChangbin DuIdentification Registers and setting the error bits of the Root Error
128*11502feaSBjorn HelgaasStatus Register accordingly. If AER error reporting is enabled in the Root
129*11502feaSBjorn HelgaasError Command Register, the Root Port generates an interrupt when an
1304e37f055SChangbin Duerror is detected.
1314e37f055SChangbin Du
132*11502feaSBjorn HelgaasNote that the errors as described above are related to the PCIe
1334e37f055SChangbin Duhierarchy and links. These errors do not include any device specific
1344e37f055SChangbin Duerrors because device specific errors will still get sent directly to
1354e37f055SChangbin Duthe device driver.
1364e37f055SChangbin Du
1374e37f055SChangbin DuProvide callbacks
1384e37f055SChangbin Du-----------------
1394e37f055SChangbin Du
140*11502feaSBjorn Helgaascallback reset_link to reset PCIe link
141*11502feaSBjorn Helgaas~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1424e37f055SChangbin Du
143*11502feaSBjorn HelgaasThis callback is used to reset the PCIe physical link when a
144*11502feaSBjorn Helgaasfatal error happens. The Root Port AER service driver provides a
145*11502feaSBjorn Helgaasdefault reset_link function, but different Upstream Ports might
146*11502feaSBjorn Helgaashave different specifications to reset the PCIe link, so
147*11502feaSBjorn HelgaasUpstream Port drivers may provide their own reset_link functions.
1484e37f055SChangbin Du
1494e37f055SChangbin DuSection 3.2.2.2 provides more detailed info on when to call
1504e37f055SChangbin Dureset_link.
1514e37f055SChangbin Du
1524e37f055SChangbin DuPCI error-recovery callbacks
1534e37f055SChangbin Du~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1544e37f055SChangbin Du
155*11502feaSBjorn HelgaasThe PCIe AER Root driver uses error callbacks to coordinate
1564e37f055SChangbin Duwith downstream device drivers associated with a hierarchy in question
1574e37f055SChangbin Duwhen performing error recovery actions.
1584e37f055SChangbin Du
1594e37f055SChangbin DuData struct pci_driver has a pointer, err_handler, to point to
1604e37f055SChangbin Dupci_error_handlers who consists of a couple of callback function
161*11502feaSBjorn Helgaaspointers. The AER driver follows the rules defined in
162*11502feaSBjorn Helgaaspci-error-recovery.rst except PCIe-specific parts (e.g.
163*11502feaSBjorn Helgaasreset_link). Please refer to pci-error-recovery.rst for detailed
1644e37f055SChangbin Dudefinitions of the callbacks.
1654e37f055SChangbin Du
166*11502feaSBjorn HelgaasThe sections below specify when to call the error callback functions.
1674e37f055SChangbin Du
1684e37f055SChangbin DuCorrectable errors
1694e37f055SChangbin Du~~~~~~~~~~~~~~~~~~
1704e37f055SChangbin Du
1714e37f055SChangbin DuCorrectable errors pose no impacts on the functionality of
172*11502feaSBjorn Helgaasthe interface. The PCIe protocol can recover without any
1734e37f055SChangbin Dusoftware intervention or any loss of data. These errors do not
1744e37f055SChangbin Durequire any recovery actions. The AER driver clears the device's
1754e37f055SChangbin Ducorrectable error status register accordingly and logs these errors.
1764e37f055SChangbin Du
1774e37f055SChangbin DuNon-correctable (non-fatal and fatal) errors
1784e37f055SChangbin Du~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1794e37f055SChangbin Du
1804e37f055SChangbin DuIf an error message indicates a non-fatal error, performing link reset
1814e37f055SChangbin Duat upstream is not required. The AER driver calls error_detected(dev,
1824e37f055SChangbin Dupci_channel_io_normal) to all drivers associated within a hierarchy in
183*11502feaSBjorn Helgaasquestion. For example::
1844e37f055SChangbin Du
185*11502feaSBjorn Helgaas  Endpoint <==> Downstream Port B <==> Upstream Port A <==> Root Port
1864e37f055SChangbin Du
187*11502feaSBjorn HelgaasIf Upstream Port A captures an AER error, the hierarchy consists of
188*11502feaSBjorn HelgaasDownstream Port B and Endpoint.
1894e37f055SChangbin Du
1904e37f055SChangbin DuA driver may return PCI_ERS_RESULT_CAN_RECOVER,
1914e37f055SChangbin DuPCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on
1924e37f055SChangbin Duwhether it can recover or the AER driver calls mmio_enabled as next.
1934e37f055SChangbin Du
1944e37f055SChangbin DuIf an error message indicates a fatal error, kernel will broadcast
1954e37f055SChangbin Duerror_detected(dev, pci_channel_io_frozen) to all drivers within
1964e37f055SChangbin Dua hierarchy in question. Then, performing link reset at upstream is
1974e37f055SChangbin Dunecessary. As different kinds of devices might use different approaches
1984e37f055SChangbin Duto reset link, AER port service driver is required to provide the
199b6cf1a42SKuppuswamy Sathyanarayananfunction to reset link via callback parameter of pcie_do_recovery()
200b6cf1a42SKuppuswamy Sathyanarayananfunction. If reset_link is not NULL, recovery function will use it
201b6cf1a42SKuppuswamy Sathyanarayananto reset the link. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER
202b6cf1a42SKuppuswamy Sathyanarayananand reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes
2034e37f055SChangbin Duto mmio_enabled.
2044e37f055SChangbin Du
2054e37f055SChangbin DuFrequent Asked Questions
2064e37f055SChangbin Du------------------------
2074e37f055SChangbin Du
2084e37f055SChangbin DuQ:
209*11502feaSBjorn Helgaas  What happens if a PCIe device driver does not provide an
2104e37f055SChangbin Du  error recovery handler (pci_driver->err_handler is equal to NULL)?
2114e37f055SChangbin Du
2124e37f055SChangbin DuA:
2134e37f055SChangbin Du  The devices attached with the driver won't be recovered. If the
2144e37f055SChangbin Du  error is fatal, kernel will print out warning messages. Please refer
2154e37f055SChangbin Du  to section 3 for more information.
2164e37f055SChangbin Du
2174e37f055SChangbin DuQ:
2184e37f055SChangbin Du  What happens if an upstream port service driver does not provide
2194e37f055SChangbin Du  callback reset_link?
2204e37f055SChangbin Du
2214e37f055SChangbin DuA:
2224e37f055SChangbin Du  Fatal error recovery will fail if the errors are reported by the
2234e37f055SChangbin Du  upstream ports who are attached by the service driver.
2244e37f055SChangbin Du
2254e37f055SChangbin Du
2264e37f055SChangbin DuSoftware error injection
2274e37f055SChangbin Du========================
2284e37f055SChangbin Du
2294e37f055SChangbin DuDebugging PCIe AER error recovery code is quite difficult because it
2304e37f055SChangbin Duis hard to trigger real hardware errors. Software based error
2314e37f055SChangbin Duinjection can be used to fake various kinds of PCIe errors.
2324e37f055SChangbin Du
2334e37f055SChangbin DuFirst you should enable PCIe AER software error injection in kernel
2344e37f055SChangbin Duconfiguration, that is, following item should be in your .config.
2354e37f055SChangbin Du
2364e37f055SChangbin DuCONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m
2374e37f055SChangbin Du
2384e37f055SChangbin DuAfter reboot with new kernel or insert the module, a device file named
2394e37f055SChangbin Du/dev/aer_inject should be created.
2404e37f055SChangbin Du
2414e37f055SChangbin DuThen, you need a user space tool named aer-inject, which can be gotten
2424e37f055SChangbin Dufrom:
2434e37f055SChangbin Du
2444e37f055SChangbin Du    https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/
2454e37f055SChangbin Du
246*11502feaSBjorn HelgaasMore information about aer-inject can be found in the document in
247*11502feaSBjorn Helgaasits source code.
248