1*edfc8730SMauro Carvalho ChehabWhat: /sys/devices/system/machinecheck/machinecheckX/ 2*edfc8730SMauro Carvalho ChehabContact: Andi Kleen <ak@linux.intel.com> 3*edfc8730SMauro Carvalho ChehabDate: Feb, 2007 4*edfc8730SMauro Carvalho ChehabDescription: 5*edfc8730SMauro Carvalho Chehab (X = CPU number) 6*edfc8730SMauro Carvalho Chehab 7*edfc8730SMauro Carvalho Chehab Machine checks report internal hardware error conditions 8*edfc8730SMauro Carvalho Chehab detected by the CPU. Uncorrected errors typically cause a 9*edfc8730SMauro Carvalho Chehab machine check (often with panic), corrected ones cause a 10*edfc8730SMauro Carvalho Chehab machine check log entry. 11*edfc8730SMauro Carvalho Chehab 12*edfc8730SMauro Carvalho Chehab For more details about the x86 machine check architecture 13*edfc8730SMauro Carvalho Chehab see the Intel and AMD architecture manuals from their 14*edfc8730SMauro Carvalho Chehab developer websites. 15*edfc8730SMauro Carvalho Chehab 16*edfc8730SMauro Carvalho Chehab For more details about the architecture 17*edfc8730SMauro Carvalho Chehab see http://one.firstfloor.org/~andi/mce.pdf 18*edfc8730SMauro Carvalho Chehab 19*edfc8730SMauro Carvalho Chehab Each CPU has its own directory. 20*edfc8730SMauro Carvalho Chehab 21*edfc8730SMauro Carvalho ChehabWhat: /sys/devices/system/machinecheck/machinecheckX/bank<Y> 22*edfc8730SMauro Carvalho ChehabContact: Andi Kleen <ak@linux.intel.com> 23*edfc8730SMauro Carvalho ChehabDate: Feb, 2007 24*edfc8730SMauro Carvalho ChehabDescription: 25*edfc8730SMauro Carvalho Chehab (Y bank number) 26*edfc8730SMauro Carvalho Chehab 27*edfc8730SMauro Carvalho Chehab 64bit Hex bitmask enabling/disabling specific subevents for 28*edfc8730SMauro Carvalho Chehab bank Y. 29*edfc8730SMauro Carvalho Chehab 30*edfc8730SMauro Carvalho Chehab When a bit in the bitmask is zero then the respective 31*edfc8730SMauro Carvalho Chehab subevent will not be reported. 32*edfc8730SMauro Carvalho Chehab 33*edfc8730SMauro Carvalho Chehab By default all events are enabled. 34*edfc8730SMauro Carvalho Chehab 35*edfc8730SMauro Carvalho Chehab Note that BIOS maintain another mask to disable specific events 36*edfc8730SMauro Carvalho Chehab per bank. This is not visible here 37*edfc8730SMauro Carvalho Chehab 38*edfc8730SMauro Carvalho ChehabWhat: /sys/devices/system/machinecheck/machinecheckX/check_interval 39*edfc8730SMauro Carvalho ChehabContact: Andi Kleen <ak@linux.intel.com> 40*edfc8730SMauro Carvalho ChehabDate: Feb, 2007 41*edfc8730SMauro Carvalho ChehabDescription: 42*edfc8730SMauro Carvalho Chehab The entries appear for each CPU, but they are truly shared 43*edfc8730SMauro Carvalho Chehab between all CPUs. 44*edfc8730SMauro Carvalho Chehab 45*edfc8730SMauro Carvalho Chehab How often to poll for corrected machine check errors, in 46*edfc8730SMauro Carvalho Chehab seconds (Note output is hexadecimal). Default 5 minutes. 47*edfc8730SMauro Carvalho Chehab When the poller finds MCEs it triggers an exponential speedup 48*edfc8730SMauro Carvalho Chehab (poll more often) on the polling interval. When the poller 49*edfc8730SMauro Carvalho Chehab stops finding MCEs, it triggers an exponential backoff 50*edfc8730SMauro Carvalho Chehab (poll less often) on the polling interval. The check_interval 51*edfc8730SMauro Carvalho Chehab variable is both the initial and maximum polling interval. 52*edfc8730SMauro Carvalho Chehab 0 means no polling for corrected machine check errors 53*edfc8730SMauro Carvalho Chehab (but some corrected errors might be still reported 54*edfc8730SMauro Carvalho Chehab in other ways) 55*edfc8730SMauro Carvalho Chehab 56*edfc8730SMauro Carvalho ChehabWhat: /sys/devices/system/machinecheck/machinecheckX/tolerant 57*edfc8730SMauro Carvalho ChehabContact: Andi Kleen <ak@linux.intel.com> 58*edfc8730SMauro Carvalho ChehabDate: Feb, 2007 59*edfc8730SMauro Carvalho ChehabDescription: 60*edfc8730SMauro Carvalho Chehab The entries appear for each CPU, but they are truly shared 61*edfc8730SMauro Carvalho Chehab between all CPUs. 62*edfc8730SMauro Carvalho Chehab 63*edfc8730SMauro Carvalho Chehab Tolerance level. When a machine check exception occurs for a 64*edfc8730SMauro Carvalho Chehab non corrected machine check the kernel can take different 65*edfc8730SMauro Carvalho Chehab actions. 66*edfc8730SMauro Carvalho Chehab 67*edfc8730SMauro Carvalho Chehab Since machine check exceptions can happen any time it is 68*edfc8730SMauro Carvalho Chehab sometimes risky for the kernel to kill a process because it 69*edfc8730SMauro Carvalho Chehab defies normal kernel locking rules. The tolerance level 70*edfc8730SMauro Carvalho Chehab configures how hard the kernel tries to recover even at some 71*edfc8730SMauro Carvalho Chehab risk of deadlock. Higher tolerant values trade potentially 72*edfc8730SMauro Carvalho Chehab better uptime with the risk of a crash or even corruption 73*edfc8730SMauro Carvalho Chehab (for tolerant >= 3). 74*edfc8730SMauro Carvalho Chehab 75*edfc8730SMauro Carvalho Chehab == =========================================================== 76*edfc8730SMauro Carvalho Chehab 0 always panic on uncorrected errors, log corrected errors 77*edfc8730SMauro Carvalho Chehab 1 panic or SIGBUS on uncorrected errors, log corrected errors 78*edfc8730SMauro Carvalho Chehab 2 SIGBUS or log uncorrected errors, log corrected errors 79*edfc8730SMauro Carvalho Chehab 3 never panic or SIGBUS, log all errors (for testing only) 80*edfc8730SMauro Carvalho Chehab == =========================================================== 81*edfc8730SMauro Carvalho Chehab 82*edfc8730SMauro Carvalho Chehab Default: 1 83*edfc8730SMauro Carvalho Chehab 84*edfc8730SMauro Carvalho Chehab Note this only makes a difference if the CPU allows recovery 85*edfc8730SMauro Carvalho Chehab from a machine check exception. Current x86 CPUs generally 86*edfc8730SMauro Carvalho Chehab do not. 87*edfc8730SMauro Carvalho Chehab 88*edfc8730SMauro Carvalho ChehabWhat: /sys/devices/system/machinecheck/machinecheckX/trigger 89*edfc8730SMauro Carvalho ChehabContact: Andi Kleen <ak@linux.intel.com> 90*edfc8730SMauro Carvalho ChehabDate: Feb, 2007 91*edfc8730SMauro Carvalho ChehabDescription: 92*edfc8730SMauro Carvalho Chehab The entries appear for each CPU, but they are truly shared 93*edfc8730SMauro Carvalho Chehab between all CPUs. 94*edfc8730SMauro Carvalho Chehab 95*edfc8730SMauro Carvalho Chehab Program to run when a machine check event is detected. 96*edfc8730SMauro Carvalho Chehab This is an alternative to running mcelog regularly from cron 97*edfc8730SMauro Carvalho Chehab and allows to detect events faster. 98*edfc8730SMauro Carvalho Chehab 99*edfc8730SMauro Carvalho ChehabWhat: /sys/devices/system/machinecheck/machinecheckX/monarch_timeout 100*edfc8730SMauro Carvalho ChehabContact: Andi Kleen <ak@linux.intel.com> 101*edfc8730SMauro Carvalho ChehabDate: Feb, 2007 102*edfc8730SMauro Carvalho ChehabDescription: 103*edfc8730SMauro Carvalho Chehab How long to wait for the other CPUs to machine check too on a 104*edfc8730SMauro Carvalho Chehab exception. 0 to disable waiting for other CPUs. 105*edfc8730SMauro Carvalho Chehab 106*edfc8730SMauro Carvalho Chehab Unit: us 107*edfc8730SMauro Carvalho Chehab 108