xref: /openbmc/linux/Documentation/trace/fprobetrace.rst (revision 590e7b2804152eeb4f9c2d7d8f5c0f5ea47cca3d)
1*590e7b28SMasami Hiramatsu (Google).. SPDX-License-Identifier: GPL-2.0
2*590e7b28SMasami Hiramatsu (Google)
3*590e7b28SMasami Hiramatsu (Google)==========================
4*590e7b28SMasami Hiramatsu (Google)Fprobe-based Event Tracing
5*590e7b28SMasami Hiramatsu (Google)==========================
6*590e7b28SMasami Hiramatsu (Google)
7*590e7b28SMasami Hiramatsu (Google).. Author: Masami Hiramatsu <mhiramat@kernel.org>
8*590e7b28SMasami Hiramatsu (Google)
9*590e7b28SMasami Hiramatsu (Google)Overview
10*590e7b28SMasami Hiramatsu (Google)--------
11*590e7b28SMasami Hiramatsu (Google)
12*590e7b28SMasami Hiramatsu (Google)Fprobe event is similar to the kprobe event, but limited to probe on
13*590e7b28SMasami Hiramatsu (Google)the function entry and exit only. It is good enough for many use cases
14*590e7b28SMasami Hiramatsu (Google)which only traces some specific functions.
15*590e7b28SMasami Hiramatsu (Google)
16*590e7b28SMasami Hiramatsu (Google)This document also covers tracepoint probe events (tprobe) since this
17*590e7b28SMasami Hiramatsu (Google)is also works only on the tracepoint entry. User can trace a part of
18*590e7b28SMasami Hiramatsu (Google)tracepoint argument, or the tracepoint without trace-event, which is
19*590e7b28SMasami Hiramatsu (Google)not exposed on tracefs.
20*590e7b28SMasami Hiramatsu (Google)
21*590e7b28SMasami Hiramatsu (Google)As same as other dynamic events, fprobe events and tracepoint probe
22*590e7b28SMasami Hiramatsu (Google)events are defined via `dynamic_events` interface file on tracefs.
23*590e7b28SMasami Hiramatsu (Google)
24*590e7b28SMasami Hiramatsu (Google)Synopsis of fprobe-events
25*590e7b28SMasami Hiramatsu (Google)-------------------------
26*590e7b28SMasami Hiramatsu (Google)::
27*590e7b28SMasami Hiramatsu (Google)
28*590e7b28SMasami Hiramatsu (Google)  f[:[GRP1/][EVENT1]] SYM [FETCHARGS]                       : Probe on function entry
29*590e7b28SMasami Hiramatsu (Google)  f[MAXACTIVE][:[GRP1/][EVENT1]] SYM%return [FETCHARGS]     : Probe on function exit
30*590e7b28SMasami Hiramatsu (Google)  t[:[GRP2/][EVENT2]] TRACEPOINT [FETCHARGS]                : Probe on tracepoint
31*590e7b28SMasami Hiramatsu (Google)
32*590e7b28SMasami Hiramatsu (Google) GRP1           : Group name for fprobe. If omitted, use "fprobes" for it.
33*590e7b28SMasami Hiramatsu (Google) GRP2           : Group name for tprobe. If omitted, use "tracepoints" for it.
34*590e7b28SMasami Hiramatsu (Google) EVENT1         : Event name for fprobe. If omitted, the event name is
35*590e7b28SMasami Hiramatsu (Google)                  "SYM__entry" or "SYM__exit".
36*590e7b28SMasami Hiramatsu (Google) EVENT2         : Event name for tprobe. If omitted, the event name is
37*590e7b28SMasami Hiramatsu (Google)                  the same as "TRACEPOINT", but if the "TRACEPOINT" starts
38*590e7b28SMasami Hiramatsu (Google)                  with a digit character, "_TRACEPOINT" is used.
39*590e7b28SMasami Hiramatsu (Google) MAXACTIVE      : Maximum number of instances of the specified function that
40*590e7b28SMasami Hiramatsu (Google)                  can be probed simultaneously, or 0 for the default value
41*590e7b28SMasami Hiramatsu (Google)                  as defined in Documentation/trace/fprobes.rst
42*590e7b28SMasami Hiramatsu (Google)
43*590e7b28SMasami Hiramatsu (Google) FETCHARGS      : Arguments. Each probe can have up to 128 args.
44*590e7b28SMasami Hiramatsu (Google)  ARG           : Fetch "ARG" function argument using BTF (only for function
45*590e7b28SMasami Hiramatsu (Google)                  entry or tracepoint.) (\*1)
46*590e7b28SMasami Hiramatsu (Google)  @ADDR         : Fetch memory at ADDR (ADDR should be in kernel)
47*590e7b28SMasami Hiramatsu (Google)  @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
48*590e7b28SMasami Hiramatsu (Google)  $stackN       : Fetch Nth entry of stack (N >= 0)
49*590e7b28SMasami Hiramatsu (Google)  $stack        : Fetch stack address.
50*590e7b28SMasami Hiramatsu (Google)  $argN         : Fetch the Nth function argument. (N >= 1) (\*2)
51*590e7b28SMasami Hiramatsu (Google)  $retval       : Fetch return value.(\*3)
52*590e7b28SMasami Hiramatsu (Google)  $comm         : Fetch current task comm.
53*590e7b28SMasami Hiramatsu (Google)  +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*4)(\*5)
54*590e7b28SMasami Hiramatsu (Google)  \IMM          : Store an immediate value to the argument.
55*590e7b28SMasami Hiramatsu (Google)  NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
56*590e7b28SMasami Hiramatsu (Google)  FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
57*590e7b28SMasami Hiramatsu (Google)                  (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
58*590e7b28SMasami Hiramatsu (Google)                  (x8/x16/x32/x64), "char", "string", "ustring", "symbol", "symstr"
59*590e7b28SMasami Hiramatsu (Google)                  and bitfield are supported.
60*590e7b28SMasami Hiramatsu (Google)
61*590e7b28SMasami Hiramatsu (Google)  (\*1) This is available only when BTF is enabled.
62*590e7b28SMasami Hiramatsu (Google)  (\*2) only for the probe on function entry (offs == 0).
63*590e7b28SMasami Hiramatsu (Google)  (\*3) only for return probe.
64*590e7b28SMasami Hiramatsu (Google)  (\*4) this is useful for fetching a field of data structures.
65*590e7b28SMasami Hiramatsu (Google)  (\*5) "u" means user-space dereference.
66*590e7b28SMasami Hiramatsu (Google)
67*590e7b28SMasami Hiramatsu (Google)For the details of TYPE, see :ref:`kprobetrace documentation <kprobetrace_types>`.
68*590e7b28SMasami Hiramatsu (Google)
69*590e7b28SMasami Hiramatsu (Google)BTF arguments
70*590e7b28SMasami Hiramatsu (Google)-------------
71*590e7b28SMasami Hiramatsu (Google)BTF (BPF Type Format) argument allows user to trace function and tracepoint
72*590e7b28SMasami Hiramatsu (Google)parameters by its name instead of ``$argN``. This feature is available if the
73*590e7b28SMasami Hiramatsu (Google)kernel is configured with CONFIG_BPF_SYSCALL and CONFIG_DEBUG_INFO_BTF.
74*590e7b28SMasami Hiramatsu (Google)If user only specify the BTF argument, the event's argument name is also
75*590e7b28SMasami Hiramatsu (Google)automatically set by the given name. ::
76*590e7b28SMasami Hiramatsu (Google)
77*590e7b28SMasami Hiramatsu (Google) # echo 'f:myprobe vfs_read count pos' >> dynamic_events
78*590e7b28SMasami Hiramatsu (Google) # cat dynamic_events
79*590e7b28SMasami Hiramatsu (Google) f:fprobes/myprobe vfs_read count=count pos=pos
80*590e7b28SMasami Hiramatsu (Google)
81*590e7b28SMasami Hiramatsu (Google)It also chooses the fetch type from BTF information. For example, in the above
82*590e7b28SMasami Hiramatsu (Google)example, the ``count`` is unsigned long, and the ``pos`` is a pointer. Thus, both
83*590e7b28SMasami Hiramatsu (Google)are converted to 64bit unsigned long, but only ``pos`` has "%Lx" print-format as
84*590e7b28SMasami Hiramatsu (Google)below ::
85*590e7b28SMasami Hiramatsu (Google)
86*590e7b28SMasami Hiramatsu (Google) # cat events/fprobes/myprobe/format
87*590e7b28SMasami Hiramatsu (Google) name: myprobe
88*590e7b28SMasami Hiramatsu (Google) ID: 1313
89*590e7b28SMasami Hiramatsu (Google) format:
90*590e7b28SMasami Hiramatsu (Google)	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
91*590e7b28SMasami Hiramatsu (Google)	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
92*590e7b28SMasami Hiramatsu (Google)	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
93*590e7b28SMasami Hiramatsu (Google)	field:int common_pid;	offset:4;	size:4;	signed:1;
94*590e7b28SMasami Hiramatsu (Google)
95*590e7b28SMasami Hiramatsu (Google)	field:unsigned long __probe_ip;	offset:8;	size:8;	signed:0;
96*590e7b28SMasami Hiramatsu (Google)	field:u64 count;	offset:16;	size:8;	signed:0;
97*590e7b28SMasami Hiramatsu (Google)	field:u64 pos;	offset:24;	size:8;	signed:0;
98*590e7b28SMasami Hiramatsu (Google)
99*590e7b28SMasami Hiramatsu (Google) print fmt: "(%lx) count=%Lu pos=0x%Lx", REC->__probe_ip, REC->count, REC->pos
100*590e7b28SMasami Hiramatsu (Google)
101*590e7b28SMasami Hiramatsu (Google)If user unsures the name of arguments, ``$arg*`` will be helpful. The ``$arg*``
102*590e7b28SMasami Hiramatsu (Google)is expanded to all function arguments of the function or the tracepoint. ::
103*590e7b28SMasami Hiramatsu (Google)
104*590e7b28SMasami Hiramatsu (Google) # echo 'f:myprobe vfs_read $arg*' >> dynamic_events
105*590e7b28SMasami Hiramatsu (Google) # cat dynamic_events
106*590e7b28SMasami Hiramatsu (Google) f:fprobes/myprobe vfs_read file=file buf=buf count=count pos=pos
107*590e7b28SMasami Hiramatsu (Google)
108*590e7b28SMasami Hiramatsu (Google)BTF also affects the ``$retval``. If user doesn't set any type, the retval type is
109*590e7b28SMasami Hiramatsu (Google)automatically picked from the BTF. If the function returns ``void``, ``$retval``
110*590e7b28SMasami Hiramatsu (Google)is rejected.
111*590e7b28SMasami Hiramatsu (Google)
112*590e7b28SMasami Hiramatsu (Google)Usage examples
113*590e7b28SMasami Hiramatsu (Google)--------------
114*590e7b28SMasami Hiramatsu (Google)Here is an example to add fprobe events on ``vfs_read()`` function entry
115*590e7b28SMasami Hiramatsu (Google)and exit, with BTF arguments.
116*590e7b28SMasami Hiramatsu (Google)::
117*590e7b28SMasami Hiramatsu (Google)
118*590e7b28SMasami Hiramatsu (Google)  # echo 'f vfs_read $arg*' >> dynamic_events
119*590e7b28SMasami Hiramatsu (Google)  # echo 'f vfs_read%return $retval' >> dynamic_events
120*590e7b28SMasami Hiramatsu (Google)  # cat dynamic_events
121*590e7b28SMasami Hiramatsu (Google) f:fprobes/vfs_read__entry vfs_read file=file buf=buf count=count pos=pos
122*590e7b28SMasami Hiramatsu (Google) f:fprobes/vfs_read__exit vfs_read%return arg1=$retval
123*590e7b28SMasami Hiramatsu (Google)  # echo 1 > events/fprobes/enable
124*590e7b28SMasami Hiramatsu (Google)  # head -n 20 trace | tail
125*590e7b28SMasami Hiramatsu (Google) #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
126*590e7b28SMasami Hiramatsu (Google) #              | |         |   |||||     |         |
127*590e7b28SMasami Hiramatsu (Google)               sh-70      [000] ...1.   335.883195: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c6879 count=1 pos=0xffffc900005aff08
128*590e7b28SMasami Hiramatsu (Google)               sh-70      [000] .....   335.883208: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1
129*590e7b28SMasami Hiramatsu (Google)               sh-70      [000] ...1.   335.883220: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c6879 count=1 pos=0xffffc900005aff08
130*590e7b28SMasami Hiramatsu (Google)               sh-70      [000] .....   335.883224: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1
131*590e7b28SMasami Hiramatsu (Google)               sh-70      [000] ...1.   335.883232: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c687a count=1 pos=0xffffc900005aff08
132*590e7b28SMasami Hiramatsu (Google)               sh-70      [000] .....   335.883237: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1
133*590e7b28SMasami Hiramatsu (Google)               sh-70      [000] ...1.   336.050329: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c6879 count=1 pos=0xffffc900005aff08
134*590e7b28SMasami Hiramatsu (Google)               sh-70      [000] .....   336.050343: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1
135*590e7b28SMasami Hiramatsu (Google)
136*590e7b28SMasami Hiramatsu (Google)You can see all function arguments and return values are recorded as signed int.
137*590e7b28SMasami Hiramatsu (Google)
138*590e7b28SMasami Hiramatsu (Google)Also, here is an example of tracepoint events on ``sched_switch`` tracepoint.
139*590e7b28SMasami Hiramatsu (Google)To compare the result, this also enables the ``sched_switch`` traceevent too.
140*590e7b28SMasami Hiramatsu (Google)::
141*590e7b28SMasami Hiramatsu (Google)
142*590e7b28SMasami Hiramatsu (Google)  # echo 't sched_switch $arg*' >> dynamic_events
143*590e7b28SMasami Hiramatsu (Google)  # echo 1 > events/sched/sched_switch/enable
144*590e7b28SMasami Hiramatsu (Google)  # echo 1 > events/tracepoints/sched_switch/enable
145*590e7b28SMasami Hiramatsu (Google)  # echo > trace
146*590e7b28SMasami Hiramatsu (Google)  # head -n 20 trace | tail
147*590e7b28SMasami Hiramatsu (Google) #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
148*590e7b28SMasami Hiramatsu (Google) #              | |         |   |||||     |         |
149*590e7b28SMasami Hiramatsu (Google)               sh-70      [000] d..2.  3912.083993: sched_switch: prev_comm=sh prev_pid=70 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
150*590e7b28SMasami Hiramatsu (Google)               sh-70      [000] d..3.  3912.083995: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffff88800664e100 next=0xffffffff828229c0 prev_state=1
151*590e7b28SMasami Hiramatsu (Google)           <idle>-0       [000] d..2.  3912.084183: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=16 next_prio=120
152*590e7b28SMasami Hiramatsu (Google)           <idle>-0       [000] d..3.  3912.084184: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffffffff828229c0 next=0xffff888004208000 prev_state=0
153*590e7b28SMasami Hiramatsu (Google)      rcu_preempt-16      [000] d..2.  3912.084196: sched_switch: prev_comm=rcu_preempt prev_pid=16 prev_prio=120 prev_state=I ==> next_comm=swapper/0 next_pid=0 next_prio=120
154*590e7b28SMasami Hiramatsu (Google)      rcu_preempt-16      [000] d..3.  3912.084196: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffff888004208000 next=0xffffffff828229c0 prev_state=1026
155*590e7b28SMasami Hiramatsu (Google)           <idle>-0       [000] d..2.  3912.085191: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=16 next_prio=120
156*590e7b28SMasami Hiramatsu (Google)           <idle>-0       [000] d..3.  3912.085191: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffffffff828229c0 next=0xffff888004208000 prev_state=0
157*590e7b28SMasami Hiramatsu (Google)
158*590e7b28SMasami Hiramatsu (Google)As you can see, the ``sched_switch`` trace-event shows *cooked* parameters, on
159*590e7b28SMasami Hiramatsu (Google)the other hand, the ``sched_switch`` tracepoint probe event shows *raw*
160*590e7b28SMasami Hiramatsu (Google)parameters. This means you can access any field values in the task
161*590e7b28SMasami Hiramatsu (Google)structure pointed by the ``prev`` and ``next`` arguments.
162*590e7b28SMasami Hiramatsu (Google)
163*590e7b28SMasami Hiramatsu (Google)For example, usually ``task_struct::start_time`` is not traced, but with this
164*590e7b28SMasami Hiramatsu (Google)traceprobe event, you can trace it as below.
165*590e7b28SMasami Hiramatsu (Google)::
166*590e7b28SMasami Hiramatsu (Google)
167*590e7b28SMasami Hiramatsu (Google)  # echo 't sched_switch comm=+1896(next):string start_time=+1728(next):u64' > dynamic_events
168*590e7b28SMasami Hiramatsu (Google)  # head -n 20 trace | tail
169*590e7b28SMasami Hiramatsu (Google) #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
170*590e7b28SMasami Hiramatsu (Google) #              | |         |   |||||     |         |
171*590e7b28SMasami Hiramatsu (Google)               sh-70      [000] d..3.  5606.686577: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="rcu_preempt" usage=1 start_time=245000000
172*590e7b28SMasami Hiramatsu (Google)      rcu_preempt-16      [000] d..3.  5606.686602: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="sh" usage=1 start_time=1596095526
173*590e7b28SMasami Hiramatsu (Google)               sh-70      [000] d..3.  5606.686637: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="swapper/0" usage=2 start_time=0
174*590e7b28SMasami Hiramatsu (Google)           <idle>-0       [000] d..3.  5606.687190: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="rcu_preempt" usage=1 start_time=245000000
175*590e7b28SMasami Hiramatsu (Google)      rcu_preempt-16      [000] d..3.  5606.687202: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="swapper/0" usage=2 start_time=0
176*590e7b28SMasami Hiramatsu (Google)           <idle>-0       [000] d..3.  5606.690317: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="kworker/0:1" usage=1 start_time=137000000
177*590e7b28SMasami Hiramatsu (Google)      kworker/0:1-14      [000] d..3.  5606.690339: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="swapper/0" usage=2 start_time=0
178*590e7b28SMasami Hiramatsu (Google)           <idle>-0       [000] d..3.  5606.692368: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="kworker/0:1" usage=1 start_time=137000000
179*590e7b28SMasami Hiramatsu (Google)
180*590e7b28SMasami Hiramatsu (Google)Currently, to find the offset of a specific field in the data structure,
181*590e7b28SMasami Hiramatsu (Google)you need to build kernel with debuginfo and run `perf probe` command with
182*590e7b28SMasami Hiramatsu (Google)`-D` option. e.g.
183*590e7b28SMasami Hiramatsu (Google)::
184*590e7b28SMasami Hiramatsu (Google)
185*590e7b28SMasami Hiramatsu (Google) # perf probe -D "__probestub_sched_switch next->comm:string next->start_time"
186*590e7b28SMasami Hiramatsu (Google) p:probe/__probestub_sched_switch __probestub_sched_switch+0 comm=+1896(%cx):string start_time=+1728(%cx):u64
187*590e7b28SMasami Hiramatsu (Google)
188*590e7b28SMasami Hiramatsu (Google)And replace the ``%cx`` with the ``next``.
189