1edfc8730SMauro Carvalho ChehabWhat: /sys/devices/system/machinecheck/machinecheckX/ 2edfc8730SMauro Carvalho ChehabContact: Andi Kleen <ak@linux.intel.com> 3edfc8730SMauro Carvalho ChehabDate: Feb, 2007 4edfc8730SMauro Carvalho ChehabDescription: 5edfc8730SMauro Carvalho Chehab (X = CPU number) 6edfc8730SMauro Carvalho Chehab 7edfc8730SMauro Carvalho Chehab Machine checks report internal hardware error conditions 8edfc8730SMauro Carvalho Chehab detected by the CPU. Uncorrected errors typically cause a 9edfc8730SMauro Carvalho Chehab machine check (often with panic), corrected ones cause a 10edfc8730SMauro Carvalho Chehab machine check log entry. 11edfc8730SMauro Carvalho Chehab 12edfc8730SMauro Carvalho Chehab For more details about the x86 machine check architecture 13edfc8730SMauro Carvalho Chehab see the Intel and AMD architecture manuals from their 14edfc8730SMauro Carvalho Chehab developer websites. 15edfc8730SMauro Carvalho Chehab 16edfc8730SMauro Carvalho Chehab For more details about the architecture 17edfc8730SMauro Carvalho Chehab see http://one.firstfloor.org/~andi/mce.pdf 18edfc8730SMauro Carvalho Chehab 19edfc8730SMauro Carvalho Chehab Each CPU has its own directory. 20edfc8730SMauro Carvalho Chehab 21edfc8730SMauro Carvalho ChehabWhat: /sys/devices/system/machinecheck/machinecheckX/bank<Y> 22edfc8730SMauro Carvalho ChehabContact: Andi Kleen <ak@linux.intel.com> 23edfc8730SMauro Carvalho ChehabDate: Feb, 2007 24edfc8730SMauro Carvalho ChehabDescription: 25edfc8730SMauro Carvalho Chehab (Y bank number) 26edfc8730SMauro Carvalho Chehab 27edfc8730SMauro Carvalho Chehab 64bit Hex bitmask enabling/disabling specific subevents for 28edfc8730SMauro Carvalho Chehab bank Y. 29edfc8730SMauro Carvalho Chehab 30edfc8730SMauro Carvalho Chehab When a bit in the bitmask is zero then the respective 31edfc8730SMauro Carvalho Chehab subevent will not be reported. 32edfc8730SMauro Carvalho Chehab 33edfc8730SMauro Carvalho Chehab By default all events are enabled. 34edfc8730SMauro Carvalho Chehab 35edfc8730SMauro Carvalho Chehab Note that BIOS maintain another mask to disable specific events 36edfc8730SMauro Carvalho Chehab per bank. This is not visible here 37edfc8730SMauro Carvalho Chehab 38edfc8730SMauro Carvalho ChehabWhat: /sys/devices/system/machinecheck/machinecheckX/check_interval 39edfc8730SMauro Carvalho ChehabContact: Andi Kleen <ak@linux.intel.com> 40edfc8730SMauro Carvalho ChehabDate: Feb, 2007 41edfc8730SMauro Carvalho ChehabDescription: 42edfc8730SMauro Carvalho Chehab The entries appear for each CPU, but they are truly shared 43edfc8730SMauro Carvalho Chehab between all CPUs. 44edfc8730SMauro Carvalho Chehab 45edfc8730SMauro Carvalho Chehab How often to poll for corrected machine check errors, in 46edfc8730SMauro Carvalho Chehab seconds (Note output is hexadecimal). Default 5 minutes. 47edfc8730SMauro Carvalho Chehab When the poller finds MCEs it triggers an exponential speedup 48edfc8730SMauro Carvalho Chehab (poll more often) on the polling interval. When the poller 49edfc8730SMauro Carvalho Chehab stops finding MCEs, it triggers an exponential backoff 50edfc8730SMauro Carvalho Chehab (poll less often) on the polling interval. The check_interval 51edfc8730SMauro Carvalho Chehab variable is both the initial and maximum polling interval. 52edfc8730SMauro Carvalho Chehab 0 means no polling for corrected machine check errors 53edfc8730SMauro Carvalho Chehab (but some corrected errors might be still reported 54edfc8730SMauro Carvalho Chehab in other ways) 55edfc8730SMauro Carvalho Chehab 56edfc8730SMauro Carvalho ChehabWhat: /sys/devices/system/machinecheck/machinecheckX/tolerant 57edfc8730SMauro Carvalho ChehabContact: Andi Kleen <ak@linux.intel.com> 58edfc8730SMauro Carvalho ChehabDate: Feb, 2007 59edfc8730SMauro Carvalho ChehabDescription: 60edfc8730SMauro Carvalho Chehab The entries appear for each CPU, but they are truly shared 61edfc8730SMauro Carvalho Chehab between all CPUs. 62edfc8730SMauro Carvalho Chehab 63edfc8730SMauro Carvalho Chehab Tolerance level. When a machine check exception occurs for a 64edfc8730SMauro Carvalho Chehab non corrected machine check the kernel can take different 65edfc8730SMauro Carvalho Chehab actions. 66edfc8730SMauro Carvalho Chehab 67edfc8730SMauro Carvalho Chehab Since machine check exceptions can happen any time it is 68edfc8730SMauro Carvalho Chehab sometimes risky for the kernel to kill a process because it 69edfc8730SMauro Carvalho Chehab defies normal kernel locking rules. The tolerance level 70edfc8730SMauro Carvalho Chehab configures how hard the kernel tries to recover even at some 71edfc8730SMauro Carvalho Chehab risk of deadlock. Higher tolerant values trade potentially 72edfc8730SMauro Carvalho Chehab better uptime with the risk of a crash or even corruption 73edfc8730SMauro Carvalho Chehab (for tolerant >= 3). 74edfc8730SMauro Carvalho Chehab 75edfc8730SMauro Carvalho Chehab == =========================================================== 76edfc8730SMauro Carvalho Chehab 0 always panic on uncorrected errors, log corrected errors 77edfc8730SMauro Carvalho Chehab 1 panic or SIGBUS on uncorrected errors, log corrected errors 78edfc8730SMauro Carvalho Chehab 2 SIGBUS or log uncorrected errors, log corrected errors 79edfc8730SMauro Carvalho Chehab 3 never panic or SIGBUS, log all errors (for testing only) 80edfc8730SMauro Carvalho Chehab == =========================================================== 81edfc8730SMauro Carvalho Chehab 82edfc8730SMauro Carvalho Chehab Default: 1 83edfc8730SMauro Carvalho Chehab 84edfc8730SMauro Carvalho Chehab Note this only makes a difference if the CPU allows recovery 85edfc8730SMauro Carvalho Chehab from a machine check exception. Current x86 CPUs generally 86edfc8730SMauro Carvalho Chehab do not. 87edfc8730SMauro Carvalho Chehab 88edfc8730SMauro Carvalho ChehabWhat: /sys/devices/system/machinecheck/machinecheckX/trigger 89edfc8730SMauro Carvalho ChehabContact: Andi Kleen <ak@linux.intel.com> 90edfc8730SMauro Carvalho ChehabDate: Feb, 2007 91edfc8730SMauro Carvalho ChehabDescription: 92edfc8730SMauro Carvalho Chehab The entries appear for each CPU, but they are truly shared 93edfc8730SMauro Carvalho Chehab between all CPUs. 94edfc8730SMauro Carvalho Chehab 95edfc8730SMauro Carvalho Chehab Program to run when a machine check event is detected. 96edfc8730SMauro Carvalho Chehab This is an alternative to running mcelog regularly from cron 97edfc8730SMauro Carvalho Chehab and allows to detect events faster. 98edfc8730SMauro Carvalho Chehab 99edfc8730SMauro Carvalho ChehabWhat: /sys/devices/system/machinecheck/machinecheckX/monarch_timeout 100edfc8730SMauro Carvalho ChehabContact: Andi Kleen <ak@linux.intel.com> 101edfc8730SMauro Carvalho ChehabDate: Feb, 2007 102edfc8730SMauro Carvalho ChehabDescription: 103edfc8730SMauro Carvalho Chehab How long to wait for the other CPUs to machine check too on a 104edfc8730SMauro Carvalho Chehab exception. 0 to disable waiting for other CPUs. 105edfc8730SMauro Carvalho Chehab 106edfc8730SMauro Carvalho Chehab Unit: us 107edfc8730SMauro Carvalho Chehab 108*bf0cf321SMauro Carvalho ChehabWhat: /sys/devices/system/machinecheck/machinecheckX/ignore_ce 109*bf0cf321SMauro Carvalho ChehabContact: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> 110*bf0cf321SMauro Carvalho ChehabDate: Jun 2009 111*bf0cf321SMauro Carvalho ChehabDescription: 112*bf0cf321SMauro Carvalho Chehab Disables polling and CMCI for corrected errors. 113*bf0cf321SMauro Carvalho Chehab All corrected events are not cleared and kept in bank MSRs. 114*bf0cf321SMauro Carvalho Chehab 115*bf0cf321SMauro Carvalho ChehabWhat: /sys/devices/system/machinecheck/machinecheckX/dont_log_ce 116*bf0cf321SMauro Carvalho ChehabContact: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> 117*bf0cf321SMauro Carvalho ChehabDate: Jun 2009 118*bf0cf321SMauro Carvalho ChehabDescription: 119*bf0cf321SMauro Carvalho Chehab Disables logging for corrected errors. 120*bf0cf321SMauro Carvalho Chehab All reported corrected errors will be cleared silently. 121*bf0cf321SMauro Carvalho Chehab 122*bf0cf321SMauro Carvalho Chehab This option will be useful if you never care about corrected 123*bf0cf321SMauro Carvalho Chehab errors. 124*bf0cf321SMauro Carvalho Chehab 125*bf0cf321SMauro Carvalho ChehabWhat: /sys/devices/system/machinecheck/machinecheckX/cmci_disabled 126*bf0cf321SMauro Carvalho ChehabContact: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> 127*bf0cf321SMauro Carvalho ChehabDate: Jun 2009 128*bf0cf321SMauro Carvalho ChehabDescription: 129*bf0cf321SMauro Carvalho Chehab Disables the CMCI feature. 130