xref: /openbmc/linux/Documentation/trace/hwlat_detector.rst (revision 8df2d75e6ee174017318148c5c643d093043c9d6)
1*8df2d75eSChangbin Du=========================
2*8df2d75eSChangbin DuHardware Latency Detector
3*8df2d75eSChangbin Du=========================
4*8df2d75eSChangbin Du
5*8df2d75eSChangbin DuIntroduction
6*8df2d75eSChangbin Du-------------
7*8df2d75eSChangbin Du
8*8df2d75eSChangbin DuThe tracer hwlat_detector is a special purpose tracer that is used to
9*8df2d75eSChangbin Dudetect large system latencies induced by the behavior of certain underlying
10*8df2d75eSChangbin Duhardware or firmware, independent of Linux itself. The code was developed
11*8df2d75eSChangbin Duoriginally to detect SMIs (System Management Interrupts) on x86 systems,
12*8df2d75eSChangbin Duhowever there is nothing x86 specific about this patchset. It was
13*8df2d75eSChangbin Duoriginally written for use by the "RT" patch since the Real Time
14*8df2d75eSChangbin Dukernel is highly latency sensitive.
15*8df2d75eSChangbin Du
16*8df2d75eSChangbin DuSMIs are not serviced by the Linux kernel, which means that it does not
17*8df2d75eSChangbin Dueven know that they are occuring. SMIs are instead set up by BIOS code
18*8df2d75eSChangbin Duand are serviced by BIOS code, usually for "critical" events such as
19*8df2d75eSChangbin Dumanagement of thermal sensors and fans. Sometimes though, SMIs are used for
20*8df2d75eSChangbin Duother tasks and those tasks can spend an inordinate amount of time in the
21*8df2d75eSChangbin Duhandler (sometimes measured in milliseconds). Obviously this is a problem if
22*8df2d75eSChangbin Duyou are trying to keep event service latencies down in the microsecond range.
23*8df2d75eSChangbin Du
24*8df2d75eSChangbin DuThe hardware latency detector works by hogging one of the cpus for configurable
25*8df2d75eSChangbin Duamounts of time (with interrupts disabled), polling the CPU Time Stamp Counter
26*8df2d75eSChangbin Dufor some period, then looking for gaps in the TSC data. Any gap indicates a
27*8df2d75eSChangbin Dutime when the polling was interrupted and since the interrupts are disabled,
28*8df2d75eSChangbin Duthe only thing that could do that would be an SMI or other hardware hiccup
29*8df2d75eSChangbin Du(or an NMI, but those can be tracked).
30*8df2d75eSChangbin Du
31*8df2d75eSChangbin DuNote that the hwlat detector should *NEVER* be used in a production environment.
32*8df2d75eSChangbin DuIt is intended to be run manually to determine if the hardware platform has a
33*8df2d75eSChangbin Duproblem with long system firmware service routines.
34*8df2d75eSChangbin Du
35*8df2d75eSChangbin DuUsage
36*8df2d75eSChangbin Du------
37*8df2d75eSChangbin Du
38*8df2d75eSChangbin DuWrite the ASCII text "hwlat" into the current_tracer file of the tracing system
39*8df2d75eSChangbin Du(mounted at /sys/kernel/tracing or /sys/kernel/tracing). It is possible to
40*8df2d75eSChangbin Duredefine the threshold in microseconds (us) above which latency spikes will
41*8df2d75eSChangbin Dube taken into account.
42*8df2d75eSChangbin Du
43*8df2d75eSChangbin DuExample::
44*8df2d75eSChangbin Du
45*8df2d75eSChangbin Du	# echo hwlat > /sys/kernel/tracing/current_tracer
46*8df2d75eSChangbin Du	# echo 100 > /sys/kernel/tracing/tracing_thresh
47*8df2d75eSChangbin Du
48*8df2d75eSChangbin DuThe /sys/kernel/tracing/hwlat_detector interface contains the following files:
49*8df2d75eSChangbin Du
50*8df2d75eSChangbin Du  - width - time period to sample with CPUs held (usecs)
51*8df2d75eSChangbin Du            must be less than the total window size (enforced)
52*8df2d75eSChangbin Du  - window - total period of sampling, width being inside (usecs)
53*8df2d75eSChangbin Du
54*8df2d75eSChangbin DuBy default the width is set to 500,000 and window to 1,000,000, meaning that
55*8df2d75eSChangbin Dufor every 1,000,000 usecs (1s) the hwlat detector will spin for 500,000 usecs
56*8df2d75eSChangbin Du(0.5s). If tracing_thresh contains zero when hwlat tracer is enabled, it will
57*8df2d75eSChangbin Duchange to a default of 10 usecs. If any latencies that exceed the threshold is
58*8df2d75eSChangbin Duobserved then the data will be written to the tracing ring buffer.
59*8df2d75eSChangbin Du
60*8df2d75eSChangbin DuThe minimum sleep time between periods is 1 millisecond. Even if width
61*8df2d75eSChangbin Duis less than 1 millisecond apart from window, to allow the system to not
62*8df2d75eSChangbin Dube totally starved.
63*8df2d75eSChangbin Du
64*8df2d75eSChangbin DuIf tracing_thresh was zero when hwlat detector was started, it will be set
65*8df2d75eSChangbin Duback to zero if another tracer is loaded. Note, the last value in
66*8df2d75eSChangbin Dutracing_thresh that hwlat detector had will be saved and this value will
67*8df2d75eSChangbin Dube restored in tracing_thresh if it is still zero when hwlat detector is
68*8df2d75eSChangbin Dustarted again.
69*8df2d75eSChangbin Du
70*8df2d75eSChangbin DuThe following tracing directory files are used by the hwlat_detector:
71*8df2d75eSChangbin Du
72*8df2d75eSChangbin Duin /sys/kernel/tracing:
73*8df2d75eSChangbin Du
74*8df2d75eSChangbin Du - tracing_threshold	- minimum latency value to be considered (usecs)
75*8df2d75eSChangbin Du - tracing_max_latency	- maximum hardware latency actually observed (usecs)
76*8df2d75eSChangbin Du - tracing_cpumask	- the CPUs to move the hwlat thread across
77*8df2d75eSChangbin Du - hwlat_detector/width	- specified amount of time to spin within window (usecs)
78*8df2d75eSChangbin Du - hwlat_detector/window	- amount of time between (width) runs (usecs)
79*8df2d75eSChangbin Du
80*8df2d75eSChangbin DuThe hwlat detector's kernel thread will migrate across each CPU specified in
81*8df2d75eSChangbin Dutracing_cpumask between each window. To limit the migration, either modify
82*8df2d75eSChangbin Dutracing_cpumask, or modify the hwlat kernel thread (named [hwlatd]) CPU
83*8df2d75eSChangbin Duaffinity directly, and the migration will stop.
84