xref: /openbmc/linux/Documentation/trace/hwlat_detector.rst (revision c900529f3d9161bfde5cca0754f83b4d3c3e0220)
18df2d75eSChangbin Du=========================
28df2d75eSChangbin DuHardware Latency Detector
38df2d75eSChangbin Du=========================
48df2d75eSChangbin Du
58df2d75eSChangbin DuIntroduction
68df2d75eSChangbin Du-------------
78df2d75eSChangbin Du
88df2d75eSChangbin DuThe tracer hwlat_detector is a special purpose tracer that is used to
98df2d75eSChangbin Dudetect large system latencies induced by the behavior of certain underlying
108df2d75eSChangbin Duhardware or firmware, independent of Linux itself. The code was developed
118df2d75eSChangbin Duoriginally to detect SMIs (System Management Interrupts) on x86 systems,
128df2d75eSChangbin Duhowever there is nothing x86 specific about this patchset. It was
138df2d75eSChangbin Duoriginally written for use by the "RT" patch since the Real Time
148df2d75eSChangbin Dukernel is highly latency sensitive.
158df2d75eSChangbin Du
168df2d75eSChangbin DuSMIs are not serviced by the Linux kernel, which means that it does not
17*d56b699dSBjorn Helgaaseven know that they are occurring. SMIs are instead set up by BIOS code
188df2d75eSChangbin Duand are serviced by BIOS code, usually for "critical" events such as
198df2d75eSChangbin Dumanagement of thermal sensors and fans. Sometimes though, SMIs are used for
208df2d75eSChangbin Duother tasks and those tasks can spend an inordinate amount of time in the
218df2d75eSChangbin Duhandler (sometimes measured in milliseconds). Obviously this is a problem if
228df2d75eSChangbin Duyou are trying to keep event service latencies down in the microsecond range.
238df2d75eSChangbin Du
248df2d75eSChangbin DuThe hardware latency detector works by hogging one of the cpus for configurable
258df2d75eSChangbin Duamounts of time (with interrupts disabled), polling the CPU Time Stamp Counter
268df2d75eSChangbin Dufor some period, then looking for gaps in the TSC data. Any gap indicates a
278df2d75eSChangbin Dutime when the polling was interrupted and since the interrupts are disabled,
288df2d75eSChangbin Duthe only thing that could do that would be an SMI or other hardware hiccup
298df2d75eSChangbin Du(or an NMI, but those can be tracked).
308df2d75eSChangbin Du
318df2d75eSChangbin DuNote that the hwlat detector should *NEVER* be used in a production environment.
328df2d75eSChangbin DuIt is intended to be run manually to determine if the hardware platform has a
338df2d75eSChangbin Duproblem with long system firmware service routines.
348df2d75eSChangbin Du
358df2d75eSChangbin DuUsage
368df2d75eSChangbin Du------
378df2d75eSChangbin Du
388df2d75eSChangbin DuWrite the ASCII text "hwlat" into the current_tracer file of the tracing system
398df2d75eSChangbin Du(mounted at /sys/kernel/tracing or /sys/kernel/tracing). It is possible to
408df2d75eSChangbin Duredefine the threshold in microseconds (us) above which latency spikes will
418df2d75eSChangbin Dube taken into account.
428df2d75eSChangbin Du
438df2d75eSChangbin DuExample::
448df2d75eSChangbin Du
458df2d75eSChangbin Du	# echo hwlat > /sys/kernel/tracing/current_tracer
468df2d75eSChangbin Du	# echo 100 > /sys/kernel/tracing/tracing_thresh
478df2d75eSChangbin Du
488df2d75eSChangbin DuThe /sys/kernel/tracing/hwlat_detector interface contains the following files:
498df2d75eSChangbin Du
508df2d75eSChangbin Du  - width - time period to sample with CPUs held (usecs)
518df2d75eSChangbin Du            must be less than the total window size (enforced)
528df2d75eSChangbin Du  - window - total period of sampling, width being inside (usecs)
538df2d75eSChangbin Du
548df2d75eSChangbin DuBy default the width is set to 500,000 and window to 1,000,000, meaning that
558df2d75eSChangbin Dufor every 1,000,000 usecs (1s) the hwlat detector will spin for 500,000 usecs
568df2d75eSChangbin Du(0.5s). If tracing_thresh contains zero when hwlat tracer is enabled, it will
578df2d75eSChangbin Duchange to a default of 10 usecs. If any latencies that exceed the threshold is
588df2d75eSChangbin Duobserved then the data will be written to the tracing ring buffer.
598df2d75eSChangbin Du
608df2d75eSChangbin DuThe minimum sleep time between periods is 1 millisecond. Even if width
618df2d75eSChangbin Duis less than 1 millisecond apart from window, to allow the system to not
628df2d75eSChangbin Dube totally starved.
638df2d75eSChangbin Du
648df2d75eSChangbin DuIf tracing_thresh was zero when hwlat detector was started, it will be set
658df2d75eSChangbin Duback to zero if another tracer is loaded. Note, the last value in
668df2d75eSChangbin Dutracing_thresh that hwlat detector had will be saved and this value will
678df2d75eSChangbin Dube restored in tracing_thresh if it is still zero when hwlat detector is
688df2d75eSChangbin Dustarted again.
698df2d75eSChangbin Du
708df2d75eSChangbin DuThe following tracing directory files are used by the hwlat_detector:
718df2d75eSChangbin Du
728df2d75eSChangbin Duin /sys/kernel/tracing:
738df2d75eSChangbin Du
748df2d75eSChangbin Du - tracing_threshold	- minimum latency value to be considered (usecs)
758df2d75eSChangbin Du - tracing_max_latency	- maximum hardware latency actually observed (usecs)
768df2d75eSChangbin Du - tracing_cpumask	- the CPUs to move the hwlat thread across
778df2d75eSChangbin Du - hwlat_detector/width	- specified amount of time to spin within window (usecs)
788df2d75eSChangbin Du - hwlat_detector/window	- amount of time between (width) runs (usecs)
798fa826b7SDaniel Bristot de Oliveira - hwlat_detector/mode	- the thread mode
808df2d75eSChangbin Du
81f46b1652SDaniel Bristot de OliveiraBy default, one hwlat detector's kernel thread will migrate across each CPU
828fa826b7SDaniel Bristot de Oliveiraspecified in cpumask at the beginning of a new window, in a round-robin
838fa826b7SDaniel Bristot de Oliveirafashion. This behavior can be changed by changing the thread mode,
848fa826b7SDaniel Bristot de Oliveirathe available options are:
858fa826b7SDaniel Bristot de Oliveira
868fa826b7SDaniel Bristot de Oliveira - none:        do not force migration
878fa826b7SDaniel Bristot de Oliveira - round-robin: migrate across each CPU specified in cpumask [default]
88f46b1652SDaniel Bristot de Oliveira - per-cpu:     create one thread for each cpu in tracing_cpumask
89