14e37f055SChangbin Du.. SPDX-License-Identifier: GPL-2.0 24e37f055SChangbin Du.. include:: <isonum.txt> 34e37f055SChangbin Du 44e37f055SChangbin Du=========================================================== 54e37f055SChangbin DuThe PCI Express Advanced Error Reporting Driver Guide HOWTO 64e37f055SChangbin Du=========================================================== 74e37f055SChangbin Du 84e37f055SChangbin Du:Authors: - T. Long Nguyen <tom.l.nguyen@intel.com> 94e37f055SChangbin Du - Yanmin Zhang <yanmin.zhang@intel.com> 104e37f055SChangbin Du 114e37f055SChangbin Du:Copyright: |copy| 2006 Intel Corporation 124e37f055SChangbin Du 134e37f055SChangbin DuOverview 144e37f055SChangbin Du=========== 154e37f055SChangbin Du 164e37f055SChangbin DuAbout this guide 174e37f055SChangbin Du---------------- 184e37f055SChangbin Du 19*11502feaSBjorn HelgaasThis guide describes the basics of the PCI Express (PCIe) Advanced Error 204e37f055SChangbin DuReporting (AER) driver and provides information on how to use it, as 21*11502feaSBjorn Helgaaswell as how to enable the drivers of Endpoint devices to conform with 22*11502feaSBjorn Helgaasthe PCIe AER driver. 234e37f055SChangbin Du 244e37f055SChangbin Du 25*11502feaSBjorn HelgaasWhat is the PCIe AER Driver? 26*11502feaSBjorn Helgaas---------------------------- 274e37f055SChangbin Du 28*11502feaSBjorn HelgaasPCIe error signaling can occur on the PCIe link itself 29*11502feaSBjorn Helgaasor on behalf of transactions initiated on the link. PCIe 304e37f055SChangbin Dudefines two error reporting paradigms: the baseline capability and 314e37f055SChangbin Duthe Advanced Error Reporting capability. The baseline capability is 32*11502feaSBjorn Helgaasrequired of all PCIe components providing a minimum defined 334e37f055SChangbin Duset of error reporting requirements. Advanced Error Reporting 34*11502feaSBjorn Helgaascapability is implemented with a PCIe Advanced Error Reporting 354e37f055SChangbin Duextended capability structure providing more robust error reporting. 364e37f055SChangbin Du 37*11502feaSBjorn HelgaasThe PCIe AER driver provides the infrastructure to support PCIe Advanced 38*11502feaSBjorn HelgaasError Reporting capability. The PCIe AER driver provides three basic 39*11502feaSBjorn Helgaasfunctions: 404e37f055SChangbin Du 414e37f055SChangbin Du - Gathers the comprehensive error information if errors occurred. 424e37f055SChangbin Du - Reports error to the users. 434e37f055SChangbin Du - Performs error recovery actions. 444e37f055SChangbin Du 45*11502feaSBjorn HelgaasThe AER driver only attaches to Root Ports and RCECs that support the PCIe 46*11502feaSBjorn HelgaasAER capability. 474e37f055SChangbin Du 484e37f055SChangbin Du 494e37f055SChangbin DuUser Guide 504e37f055SChangbin Du========== 514e37f055SChangbin Du 52*11502feaSBjorn HelgaasInclude the PCIe AER Root Driver into the Linux Kernel 53*11502feaSBjorn Helgaas------------------------------------------------------ 544e37f055SChangbin Du 55*11502feaSBjorn HelgaasThe PCIe AER driver is a Root Port service driver attached 56*11502feaSBjorn Helgaasvia the PCIe Port Bus driver. If a user wants to use it, the driver 57*11502feaSBjorn Helgaasmust be compiled. It is enabled with CONFIG_PCIEAER, which 58*11502feaSBjorn Helgaasdepends on CONFIG_PCIEPORTBUS. 594e37f055SChangbin Du 60*11502feaSBjorn HelgaasLoad PCIe AER Root Driver 61*11502feaSBjorn Helgaas------------------------- 624e37f055SChangbin Du 634e37f055SChangbin DuSome systems have AER support in firmware. Enabling Linux AER support at 64*11502feaSBjorn Helgaasthe same time the firmware handles AER would result in unpredictable 654e37f055SChangbin Dubehavior. Therefore, Linux does not handle AER events unless the firmware 66*11502feaSBjorn Helgaasgrants AER control to the OS via the ACPI _OSC method. See the PCI Firmware 674e37f055SChangbin DuSpecification for details regarding _OSC usage. 684e37f055SChangbin Du 694e37f055SChangbin DuAER error output 704e37f055SChangbin Du---------------- 714e37f055SChangbin Du 724e37f055SChangbin DuWhen a PCIe AER error is captured, an error message will be output to 73*11502feaSBjorn Helgaasconsole. If it's a correctable error, it is output as an info message. 744e37f055SChangbin DuOtherwise, it is printed as an error. So users could choose different 754e37f055SChangbin Dulog level to filter out correctable error messages. 764e37f055SChangbin Du 774e37f055SChangbin DuBelow shows an example:: 784e37f055SChangbin Du 794e37f055SChangbin Du 0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID) 804e37f055SChangbin Du 0000:50:00.0: device [8086:0329] error status/mask=00100000/00000000 814e37f055SChangbin Du 0000:50:00.0: [20] Unsupported Request (First) 824e37f055SChangbin Du 0000:50:00.0: TLP Header: 04000001 00200a03 05010000 00050100 834e37f055SChangbin Du 84*11502feaSBjorn HelgaasIn the example, 'Requester ID' means the ID of the device that sent 85*11502feaSBjorn Helgaasthe error message to the Root Port. Please refer to PCIe specs for other 86*11502feaSBjorn Helgaasfields. 874e37f055SChangbin Du 884e37f055SChangbin DuAER Statistics / Counters 894e37f055SChangbin Du------------------------- 904e37f055SChangbin Du 914e37f055SChangbin DuWhen PCIe AER errors are captured, the counters / statistics are also exposed 924e37f055SChangbin Duin the form of sysfs attributes which are documented at 934e37f055SChangbin DuDocumentation/ABI/testing/sysfs-bus-pci-devices-aer_stats 944e37f055SChangbin Du 954e37f055SChangbin DuDeveloper Guide 964e37f055SChangbin Du=============== 974e37f055SChangbin Du 98*11502feaSBjorn HelgaasTo enable error recovery, a software driver must provide callbacks. 994e37f055SChangbin Du 100*11502feaSBjorn HelgaasTo support AER better, developers need to understand how AER works. 1014e37f055SChangbin Du 102*11502feaSBjorn HelgaasPCIe errors are classified into two types: correctable errors 103*11502feaSBjorn Helgaasand uncorrectable errors. This classification is based on the impact 1044e37f055SChangbin Duof those errors, which may result in degraded performance or function 1054e37f055SChangbin Dufailure. 1064e37f055SChangbin Du 1074e37f055SChangbin DuCorrectable errors pose no impacts on the functionality of the 108*11502feaSBjorn Helgaasinterface. The PCIe protocol can recover without any software 1094e37f055SChangbin Duintervention or any loss of data. These errors are detected and 110*11502feaSBjorn Helgaascorrected by hardware. 111*11502feaSBjorn Helgaas 112*11502feaSBjorn HelgaasUnlike correctable errors, uncorrectable 1134e37f055SChangbin Duerrors impact functionality of the interface. Uncorrectable errors 114*11502feaSBjorn Helgaascan cause a particular transaction or a particular PCIe link 1154e37f055SChangbin Duto be unreliable. Depending on those error conditions, uncorrectable 1164e37f055SChangbin Duerrors are further classified into non-fatal errors and fatal errors. 1174e37f055SChangbin DuNon-fatal errors cause the particular transaction to be unreliable, 118*11502feaSBjorn Helgaasbut the PCIe link itself is fully functional. Fatal errors, on 1194e37f055SChangbin Duthe other hand, cause the link to be unreliable. 1204e37f055SChangbin Du 121*11502feaSBjorn HelgaasWhen PCIe error reporting is enabled, a device will automatically send an 122*11502feaSBjorn Helgaaserror message to the Root Port above it when it captures 1234e37f055SChangbin Duan error. The Root Port, upon receiving an error reporting message, 124*11502feaSBjorn Helgaasinternally processes and logs the error message in its AER 125*11502feaSBjorn HelgaasCapability structure. Error information being logged includes storing 1264e37f055SChangbin Duthe error reporting agent's requestor ID into the Error Source 1274e37f055SChangbin DuIdentification Registers and setting the error bits of the Root Error 128*11502feaSBjorn HelgaasStatus Register accordingly. If AER error reporting is enabled in the Root 129*11502feaSBjorn HelgaasError Command Register, the Root Port generates an interrupt when an 1304e37f055SChangbin Duerror is detected. 1314e37f055SChangbin Du 132*11502feaSBjorn HelgaasNote that the errors as described above are related to the PCIe 1334e37f055SChangbin Duhierarchy and links. These errors do not include any device specific 1344e37f055SChangbin Duerrors because device specific errors will still get sent directly to 1354e37f055SChangbin Duthe device driver. 1364e37f055SChangbin Du 1374e37f055SChangbin DuProvide callbacks 1384e37f055SChangbin Du----------------- 1394e37f055SChangbin Du 140*11502feaSBjorn Helgaascallback reset_link to reset PCIe link 141*11502feaSBjorn Helgaas~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1424e37f055SChangbin Du 143*11502feaSBjorn HelgaasThis callback is used to reset the PCIe physical link when a 144*11502feaSBjorn Helgaasfatal error happens. The Root Port AER service driver provides a 145*11502feaSBjorn Helgaasdefault reset_link function, but different Upstream Ports might 146*11502feaSBjorn Helgaashave different specifications to reset the PCIe link, so 147*11502feaSBjorn HelgaasUpstream Port drivers may provide their own reset_link functions. 1484e37f055SChangbin Du 1494e37f055SChangbin DuSection 3.2.2.2 provides more detailed info on when to call 1504e37f055SChangbin Dureset_link. 1514e37f055SChangbin Du 1524e37f055SChangbin DuPCI error-recovery callbacks 1534e37f055SChangbin Du~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1544e37f055SChangbin Du 155*11502feaSBjorn HelgaasThe PCIe AER Root driver uses error callbacks to coordinate 1564e37f055SChangbin Duwith downstream device drivers associated with a hierarchy in question 1574e37f055SChangbin Duwhen performing error recovery actions. 1584e37f055SChangbin Du 1594e37f055SChangbin DuData struct pci_driver has a pointer, err_handler, to point to 1604e37f055SChangbin Dupci_error_handlers who consists of a couple of callback function 161*11502feaSBjorn Helgaaspointers. The AER driver follows the rules defined in 162*11502feaSBjorn Helgaaspci-error-recovery.rst except PCIe-specific parts (e.g. 163*11502feaSBjorn Helgaasreset_link). Please refer to pci-error-recovery.rst for detailed 1644e37f055SChangbin Dudefinitions of the callbacks. 1654e37f055SChangbin Du 166*11502feaSBjorn HelgaasThe sections below specify when to call the error callback functions. 1674e37f055SChangbin Du 1684e37f055SChangbin DuCorrectable errors 1694e37f055SChangbin Du~~~~~~~~~~~~~~~~~~ 1704e37f055SChangbin Du 1714e37f055SChangbin DuCorrectable errors pose no impacts on the functionality of 172*11502feaSBjorn Helgaasthe interface. The PCIe protocol can recover without any 1734e37f055SChangbin Dusoftware intervention or any loss of data. These errors do not 1744e37f055SChangbin Durequire any recovery actions. The AER driver clears the device's 1754e37f055SChangbin Ducorrectable error status register accordingly and logs these errors. 1764e37f055SChangbin Du 1774e37f055SChangbin DuNon-correctable (non-fatal and fatal) errors 1784e37f055SChangbin Du~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1794e37f055SChangbin Du 1804e37f055SChangbin DuIf an error message indicates a non-fatal error, performing link reset 1814e37f055SChangbin Duat upstream is not required. The AER driver calls error_detected(dev, 1824e37f055SChangbin Dupci_channel_io_normal) to all drivers associated within a hierarchy in 183*11502feaSBjorn Helgaasquestion. For example:: 1844e37f055SChangbin Du 185*11502feaSBjorn Helgaas Endpoint <==> Downstream Port B <==> Upstream Port A <==> Root Port 1864e37f055SChangbin Du 187*11502feaSBjorn HelgaasIf Upstream Port A captures an AER error, the hierarchy consists of 188*11502feaSBjorn HelgaasDownstream Port B and Endpoint. 1894e37f055SChangbin Du 1904e37f055SChangbin DuA driver may return PCI_ERS_RESULT_CAN_RECOVER, 1914e37f055SChangbin DuPCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on 1924e37f055SChangbin Duwhether it can recover or the AER driver calls mmio_enabled as next. 1934e37f055SChangbin Du 1944e37f055SChangbin DuIf an error message indicates a fatal error, kernel will broadcast 1954e37f055SChangbin Duerror_detected(dev, pci_channel_io_frozen) to all drivers within 1964e37f055SChangbin Dua hierarchy in question. Then, performing link reset at upstream is 1974e37f055SChangbin Dunecessary. As different kinds of devices might use different approaches 1984e37f055SChangbin Duto reset link, AER port service driver is required to provide the 199b6cf1a42SKuppuswamy Sathyanarayananfunction to reset link via callback parameter of pcie_do_recovery() 200b6cf1a42SKuppuswamy Sathyanarayananfunction. If reset_link is not NULL, recovery function will use it 201b6cf1a42SKuppuswamy Sathyanarayananto reset the link. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER 202b6cf1a42SKuppuswamy Sathyanarayananand reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes 2034e37f055SChangbin Duto mmio_enabled. 2044e37f055SChangbin Du 2054e37f055SChangbin DuFrequent Asked Questions 2064e37f055SChangbin Du------------------------ 2074e37f055SChangbin Du 2084e37f055SChangbin DuQ: 209*11502feaSBjorn Helgaas What happens if a PCIe device driver does not provide an 2104e37f055SChangbin Du error recovery handler (pci_driver->err_handler is equal to NULL)? 2114e37f055SChangbin Du 2124e37f055SChangbin DuA: 2134e37f055SChangbin Du The devices attached with the driver won't be recovered. If the 2144e37f055SChangbin Du error is fatal, kernel will print out warning messages. Please refer 2154e37f055SChangbin Du to section 3 for more information. 2164e37f055SChangbin Du 2174e37f055SChangbin DuQ: 2184e37f055SChangbin Du What happens if an upstream port service driver does not provide 2194e37f055SChangbin Du callback reset_link? 2204e37f055SChangbin Du 2214e37f055SChangbin DuA: 2224e37f055SChangbin Du Fatal error recovery will fail if the errors are reported by the 2234e37f055SChangbin Du upstream ports who are attached by the service driver. 2244e37f055SChangbin Du 2254e37f055SChangbin Du 2264e37f055SChangbin DuSoftware error injection 2274e37f055SChangbin Du======================== 2284e37f055SChangbin Du 2294e37f055SChangbin DuDebugging PCIe AER error recovery code is quite difficult because it 2304e37f055SChangbin Duis hard to trigger real hardware errors. Software based error 2314e37f055SChangbin Duinjection can be used to fake various kinds of PCIe errors. 2324e37f055SChangbin Du 2334e37f055SChangbin DuFirst you should enable PCIe AER software error injection in kernel 2344e37f055SChangbin Duconfiguration, that is, following item should be in your .config. 2354e37f055SChangbin Du 2364e37f055SChangbin DuCONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m 2374e37f055SChangbin Du 2384e37f055SChangbin DuAfter reboot with new kernel or insert the module, a device file named 2394e37f055SChangbin Du/dev/aer_inject should be created. 2404e37f055SChangbin Du 2414e37f055SChangbin DuThen, you need a user space tool named aer-inject, which can be gotten 2424e37f055SChangbin Dufrom: 2434e37f055SChangbin Du 2444e37f055SChangbin Du https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/ 2454e37f055SChangbin Du 246*11502feaSBjorn HelgaasMore information about aer-inject can be found in the document in 247*11502feaSBjorn Helgaasits source code. 248