1*590e7b28SMasami Hiramatsu (Google).. SPDX-License-Identifier: GPL-2.0 2*590e7b28SMasami Hiramatsu (Google) 3*590e7b28SMasami Hiramatsu (Google)========================== 4*590e7b28SMasami Hiramatsu (Google)Fprobe-based Event Tracing 5*590e7b28SMasami Hiramatsu (Google)========================== 6*590e7b28SMasami Hiramatsu (Google) 7*590e7b28SMasami Hiramatsu (Google).. Author: Masami Hiramatsu <mhiramat@kernel.org> 8*590e7b28SMasami Hiramatsu (Google) 9*590e7b28SMasami Hiramatsu (Google)Overview 10*590e7b28SMasami Hiramatsu (Google)-------- 11*590e7b28SMasami Hiramatsu (Google) 12*590e7b28SMasami Hiramatsu (Google)Fprobe event is similar to the kprobe event, but limited to probe on 13*590e7b28SMasami Hiramatsu (Google)the function entry and exit only. It is good enough for many use cases 14*590e7b28SMasami Hiramatsu (Google)which only traces some specific functions. 15*590e7b28SMasami Hiramatsu (Google) 16*590e7b28SMasami Hiramatsu (Google)This document also covers tracepoint probe events (tprobe) since this 17*590e7b28SMasami Hiramatsu (Google)is also works only on the tracepoint entry. User can trace a part of 18*590e7b28SMasami Hiramatsu (Google)tracepoint argument, or the tracepoint without trace-event, which is 19*590e7b28SMasami Hiramatsu (Google)not exposed on tracefs. 20*590e7b28SMasami Hiramatsu (Google) 21*590e7b28SMasami Hiramatsu (Google)As same as other dynamic events, fprobe events and tracepoint probe 22*590e7b28SMasami Hiramatsu (Google)events are defined via `dynamic_events` interface file on tracefs. 23*590e7b28SMasami Hiramatsu (Google) 24*590e7b28SMasami Hiramatsu (Google)Synopsis of fprobe-events 25*590e7b28SMasami Hiramatsu (Google)------------------------- 26*590e7b28SMasami Hiramatsu (Google):: 27*590e7b28SMasami Hiramatsu (Google) 28*590e7b28SMasami Hiramatsu (Google) f[:[GRP1/][EVENT1]] SYM [FETCHARGS] : Probe on function entry 29*590e7b28SMasami Hiramatsu (Google) f[MAXACTIVE][:[GRP1/][EVENT1]] SYM%return [FETCHARGS] : Probe on function exit 30*590e7b28SMasami Hiramatsu (Google) t[:[GRP2/][EVENT2]] TRACEPOINT [FETCHARGS] : Probe on tracepoint 31*590e7b28SMasami Hiramatsu (Google) 32*590e7b28SMasami Hiramatsu (Google) GRP1 : Group name for fprobe. If omitted, use "fprobes" for it. 33*590e7b28SMasami Hiramatsu (Google) GRP2 : Group name for tprobe. If omitted, use "tracepoints" for it. 34*590e7b28SMasami Hiramatsu (Google) EVENT1 : Event name for fprobe. If omitted, the event name is 35*590e7b28SMasami Hiramatsu (Google) "SYM__entry" or "SYM__exit". 36*590e7b28SMasami Hiramatsu (Google) EVENT2 : Event name for tprobe. If omitted, the event name is 37*590e7b28SMasami Hiramatsu (Google) the same as "TRACEPOINT", but if the "TRACEPOINT" starts 38*590e7b28SMasami Hiramatsu (Google) with a digit character, "_TRACEPOINT" is used. 39*590e7b28SMasami Hiramatsu (Google) MAXACTIVE : Maximum number of instances of the specified function that 40*590e7b28SMasami Hiramatsu (Google) can be probed simultaneously, or 0 for the default value 41*590e7b28SMasami Hiramatsu (Google) as defined in Documentation/trace/fprobes.rst 42*590e7b28SMasami Hiramatsu (Google) 43*590e7b28SMasami Hiramatsu (Google) FETCHARGS : Arguments. Each probe can have up to 128 args. 44*590e7b28SMasami Hiramatsu (Google) ARG : Fetch "ARG" function argument using BTF (only for function 45*590e7b28SMasami Hiramatsu (Google) entry or tracepoint.) (\*1) 46*590e7b28SMasami Hiramatsu (Google) @ADDR : Fetch memory at ADDR (ADDR should be in kernel) 47*590e7b28SMasami Hiramatsu (Google) @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol) 48*590e7b28SMasami Hiramatsu (Google) $stackN : Fetch Nth entry of stack (N >= 0) 49*590e7b28SMasami Hiramatsu (Google) $stack : Fetch stack address. 50*590e7b28SMasami Hiramatsu (Google) $argN : Fetch the Nth function argument. (N >= 1) (\*2) 51*590e7b28SMasami Hiramatsu (Google) $retval : Fetch return value.(\*3) 52*590e7b28SMasami Hiramatsu (Google) $comm : Fetch current task comm. 53*590e7b28SMasami Hiramatsu (Google) +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*4)(\*5) 54*590e7b28SMasami Hiramatsu (Google) \IMM : Store an immediate value to the argument. 55*590e7b28SMasami Hiramatsu (Google) NAME=FETCHARG : Set NAME as the argument name of FETCHARG. 56*590e7b28SMasami Hiramatsu (Google) FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types 57*590e7b28SMasami Hiramatsu (Google) (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types 58*590e7b28SMasami Hiramatsu (Google) (x8/x16/x32/x64), "char", "string", "ustring", "symbol", "symstr" 59*590e7b28SMasami Hiramatsu (Google) and bitfield are supported. 60*590e7b28SMasami Hiramatsu (Google) 61*590e7b28SMasami Hiramatsu (Google) (\*1) This is available only when BTF is enabled. 62*590e7b28SMasami Hiramatsu (Google) (\*2) only for the probe on function entry (offs == 0). 63*590e7b28SMasami Hiramatsu (Google) (\*3) only for return probe. 64*590e7b28SMasami Hiramatsu (Google) (\*4) this is useful for fetching a field of data structures. 65*590e7b28SMasami Hiramatsu (Google) (\*5) "u" means user-space dereference. 66*590e7b28SMasami Hiramatsu (Google) 67*590e7b28SMasami Hiramatsu (Google)For the details of TYPE, see :ref:`kprobetrace documentation <kprobetrace_types>`. 68*590e7b28SMasami Hiramatsu (Google) 69*590e7b28SMasami Hiramatsu (Google)BTF arguments 70*590e7b28SMasami Hiramatsu (Google)------------- 71*590e7b28SMasami Hiramatsu (Google)BTF (BPF Type Format) argument allows user to trace function and tracepoint 72*590e7b28SMasami Hiramatsu (Google)parameters by its name instead of ``$argN``. This feature is available if the 73*590e7b28SMasami Hiramatsu (Google)kernel is configured with CONFIG_BPF_SYSCALL and CONFIG_DEBUG_INFO_BTF. 74*590e7b28SMasami Hiramatsu (Google)If user only specify the BTF argument, the event's argument name is also 75*590e7b28SMasami Hiramatsu (Google)automatically set by the given name. :: 76*590e7b28SMasami Hiramatsu (Google) 77*590e7b28SMasami Hiramatsu (Google) # echo 'f:myprobe vfs_read count pos' >> dynamic_events 78*590e7b28SMasami Hiramatsu (Google) # cat dynamic_events 79*590e7b28SMasami Hiramatsu (Google) f:fprobes/myprobe vfs_read count=count pos=pos 80*590e7b28SMasami Hiramatsu (Google) 81*590e7b28SMasami Hiramatsu (Google)It also chooses the fetch type from BTF information. For example, in the above 82*590e7b28SMasami Hiramatsu (Google)example, the ``count`` is unsigned long, and the ``pos`` is a pointer. Thus, both 83*590e7b28SMasami Hiramatsu (Google)are converted to 64bit unsigned long, but only ``pos`` has "%Lx" print-format as 84*590e7b28SMasami Hiramatsu (Google)below :: 85*590e7b28SMasami Hiramatsu (Google) 86*590e7b28SMasami Hiramatsu (Google) # cat events/fprobes/myprobe/format 87*590e7b28SMasami Hiramatsu (Google) name: myprobe 88*590e7b28SMasami Hiramatsu (Google) ID: 1313 89*590e7b28SMasami Hiramatsu (Google) format: 90*590e7b28SMasami Hiramatsu (Google) field:unsigned short common_type; offset:0; size:2; signed:0; 91*590e7b28SMasami Hiramatsu (Google) field:unsigned char common_flags; offset:2; size:1; signed:0; 92*590e7b28SMasami Hiramatsu (Google) field:unsigned char common_preempt_count; offset:3; size:1; signed:0; 93*590e7b28SMasami Hiramatsu (Google) field:int common_pid; offset:4; size:4; signed:1; 94*590e7b28SMasami Hiramatsu (Google) 95*590e7b28SMasami Hiramatsu (Google) field:unsigned long __probe_ip; offset:8; size:8; signed:0; 96*590e7b28SMasami Hiramatsu (Google) field:u64 count; offset:16; size:8; signed:0; 97*590e7b28SMasami Hiramatsu (Google) field:u64 pos; offset:24; size:8; signed:0; 98*590e7b28SMasami Hiramatsu (Google) 99*590e7b28SMasami Hiramatsu (Google) print fmt: "(%lx) count=%Lu pos=0x%Lx", REC->__probe_ip, REC->count, REC->pos 100*590e7b28SMasami Hiramatsu (Google) 101*590e7b28SMasami Hiramatsu (Google)If user unsures the name of arguments, ``$arg*`` will be helpful. The ``$arg*`` 102*590e7b28SMasami Hiramatsu (Google)is expanded to all function arguments of the function or the tracepoint. :: 103*590e7b28SMasami Hiramatsu (Google) 104*590e7b28SMasami Hiramatsu (Google) # echo 'f:myprobe vfs_read $arg*' >> dynamic_events 105*590e7b28SMasami Hiramatsu (Google) # cat dynamic_events 106*590e7b28SMasami Hiramatsu (Google) f:fprobes/myprobe vfs_read file=file buf=buf count=count pos=pos 107*590e7b28SMasami Hiramatsu (Google) 108*590e7b28SMasami Hiramatsu (Google)BTF also affects the ``$retval``. If user doesn't set any type, the retval type is 109*590e7b28SMasami Hiramatsu (Google)automatically picked from the BTF. If the function returns ``void``, ``$retval`` 110*590e7b28SMasami Hiramatsu (Google)is rejected. 111*590e7b28SMasami Hiramatsu (Google) 112*590e7b28SMasami Hiramatsu (Google)Usage examples 113*590e7b28SMasami Hiramatsu (Google)-------------- 114*590e7b28SMasami Hiramatsu (Google)Here is an example to add fprobe events on ``vfs_read()`` function entry 115*590e7b28SMasami Hiramatsu (Google)and exit, with BTF arguments. 116*590e7b28SMasami Hiramatsu (Google):: 117*590e7b28SMasami Hiramatsu (Google) 118*590e7b28SMasami Hiramatsu (Google) # echo 'f vfs_read $arg*' >> dynamic_events 119*590e7b28SMasami Hiramatsu (Google) # echo 'f vfs_read%return $retval' >> dynamic_events 120*590e7b28SMasami Hiramatsu (Google) # cat dynamic_events 121*590e7b28SMasami Hiramatsu (Google) f:fprobes/vfs_read__entry vfs_read file=file buf=buf count=count pos=pos 122*590e7b28SMasami Hiramatsu (Google) f:fprobes/vfs_read__exit vfs_read%return arg1=$retval 123*590e7b28SMasami Hiramatsu (Google) # echo 1 > events/fprobes/enable 124*590e7b28SMasami Hiramatsu (Google) # head -n 20 trace | tail 125*590e7b28SMasami Hiramatsu (Google) # TASK-PID CPU# ||||| TIMESTAMP FUNCTION 126*590e7b28SMasami Hiramatsu (Google) # | | | ||||| | | 127*590e7b28SMasami Hiramatsu (Google) sh-70 [000] ...1. 335.883195: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c6879 count=1 pos=0xffffc900005aff08 128*590e7b28SMasami Hiramatsu (Google) sh-70 [000] ..... 335.883208: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1 129*590e7b28SMasami Hiramatsu (Google) sh-70 [000] ...1. 335.883220: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c6879 count=1 pos=0xffffc900005aff08 130*590e7b28SMasami Hiramatsu (Google) sh-70 [000] ..... 335.883224: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1 131*590e7b28SMasami Hiramatsu (Google) sh-70 [000] ...1. 335.883232: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c687a count=1 pos=0xffffc900005aff08 132*590e7b28SMasami Hiramatsu (Google) sh-70 [000] ..... 335.883237: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1 133*590e7b28SMasami Hiramatsu (Google) sh-70 [000] ...1. 336.050329: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c6879 count=1 pos=0xffffc900005aff08 134*590e7b28SMasami Hiramatsu (Google) sh-70 [000] ..... 336.050343: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1 135*590e7b28SMasami Hiramatsu (Google) 136*590e7b28SMasami Hiramatsu (Google)You can see all function arguments and return values are recorded as signed int. 137*590e7b28SMasami Hiramatsu (Google) 138*590e7b28SMasami Hiramatsu (Google)Also, here is an example of tracepoint events on ``sched_switch`` tracepoint. 139*590e7b28SMasami Hiramatsu (Google)To compare the result, this also enables the ``sched_switch`` traceevent too. 140*590e7b28SMasami Hiramatsu (Google):: 141*590e7b28SMasami Hiramatsu (Google) 142*590e7b28SMasami Hiramatsu (Google) # echo 't sched_switch $arg*' >> dynamic_events 143*590e7b28SMasami Hiramatsu (Google) # echo 1 > events/sched/sched_switch/enable 144*590e7b28SMasami Hiramatsu (Google) # echo 1 > events/tracepoints/sched_switch/enable 145*590e7b28SMasami Hiramatsu (Google) # echo > trace 146*590e7b28SMasami Hiramatsu (Google) # head -n 20 trace | tail 147*590e7b28SMasami Hiramatsu (Google) # TASK-PID CPU# ||||| TIMESTAMP FUNCTION 148*590e7b28SMasami Hiramatsu (Google) # | | | ||||| | | 149*590e7b28SMasami Hiramatsu (Google) sh-70 [000] d..2. 3912.083993: sched_switch: prev_comm=sh prev_pid=70 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120 150*590e7b28SMasami Hiramatsu (Google) sh-70 [000] d..3. 3912.083995: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffff88800664e100 next=0xffffffff828229c0 prev_state=1 151*590e7b28SMasami Hiramatsu (Google) <idle>-0 [000] d..2. 3912.084183: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=16 next_prio=120 152*590e7b28SMasami Hiramatsu (Google) <idle>-0 [000] d..3. 3912.084184: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffffffff828229c0 next=0xffff888004208000 prev_state=0 153*590e7b28SMasami Hiramatsu (Google) rcu_preempt-16 [000] d..2. 3912.084196: sched_switch: prev_comm=rcu_preempt prev_pid=16 prev_prio=120 prev_state=I ==> next_comm=swapper/0 next_pid=0 next_prio=120 154*590e7b28SMasami Hiramatsu (Google) rcu_preempt-16 [000] d..3. 3912.084196: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffff888004208000 next=0xffffffff828229c0 prev_state=1026 155*590e7b28SMasami Hiramatsu (Google) <idle>-0 [000] d..2. 3912.085191: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=16 next_prio=120 156*590e7b28SMasami Hiramatsu (Google) <idle>-0 [000] d..3. 3912.085191: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffffffff828229c0 next=0xffff888004208000 prev_state=0 157*590e7b28SMasami Hiramatsu (Google) 158*590e7b28SMasami Hiramatsu (Google)As you can see, the ``sched_switch`` trace-event shows *cooked* parameters, on 159*590e7b28SMasami Hiramatsu (Google)the other hand, the ``sched_switch`` tracepoint probe event shows *raw* 160*590e7b28SMasami Hiramatsu (Google)parameters. This means you can access any field values in the task 161*590e7b28SMasami Hiramatsu (Google)structure pointed by the ``prev`` and ``next`` arguments. 162*590e7b28SMasami Hiramatsu (Google) 163*590e7b28SMasami Hiramatsu (Google)For example, usually ``task_struct::start_time`` is not traced, but with this 164*590e7b28SMasami Hiramatsu (Google)traceprobe event, you can trace it as below. 165*590e7b28SMasami Hiramatsu (Google):: 166*590e7b28SMasami Hiramatsu (Google) 167*590e7b28SMasami Hiramatsu (Google) # echo 't sched_switch comm=+1896(next):string start_time=+1728(next):u64' > dynamic_events 168*590e7b28SMasami Hiramatsu (Google) # head -n 20 trace | tail 169*590e7b28SMasami Hiramatsu (Google) # TASK-PID CPU# ||||| TIMESTAMP FUNCTION 170*590e7b28SMasami Hiramatsu (Google) # | | | ||||| | | 171*590e7b28SMasami Hiramatsu (Google) sh-70 [000] d..3. 5606.686577: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="rcu_preempt" usage=1 start_time=245000000 172*590e7b28SMasami Hiramatsu (Google) rcu_preempt-16 [000] d..3. 5606.686602: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="sh" usage=1 start_time=1596095526 173*590e7b28SMasami Hiramatsu (Google) sh-70 [000] d..3. 5606.686637: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="swapper/0" usage=2 start_time=0 174*590e7b28SMasami Hiramatsu (Google) <idle>-0 [000] d..3. 5606.687190: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="rcu_preempt" usage=1 start_time=245000000 175*590e7b28SMasami Hiramatsu (Google) rcu_preempt-16 [000] d..3. 5606.687202: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="swapper/0" usage=2 start_time=0 176*590e7b28SMasami Hiramatsu (Google) <idle>-0 [000] d..3. 5606.690317: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="kworker/0:1" usage=1 start_time=137000000 177*590e7b28SMasami Hiramatsu (Google) kworker/0:1-14 [000] d..3. 5606.690339: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="swapper/0" usage=2 start_time=0 178*590e7b28SMasami Hiramatsu (Google) <idle>-0 [000] d..3. 5606.692368: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="kworker/0:1" usage=1 start_time=137000000 179*590e7b28SMasami Hiramatsu (Google) 180*590e7b28SMasami Hiramatsu (Google)Currently, to find the offset of a specific field in the data structure, 181*590e7b28SMasami Hiramatsu (Google)you need to build kernel with debuginfo and run `perf probe` command with 182*590e7b28SMasami Hiramatsu (Google)`-D` option. e.g. 183*590e7b28SMasami Hiramatsu (Google):: 184*590e7b28SMasami Hiramatsu (Google) 185*590e7b28SMasami Hiramatsu (Google) # perf probe -D "__probestub_sched_switch next->comm:string next->start_time" 186*590e7b28SMasami Hiramatsu (Google) p:probe/__probestub_sched_switch __probestub_sched_switch+0 comm=+1896(%cx):string start_time=+1728(%cx):u64 187*590e7b28SMasami Hiramatsu (Google) 188*590e7b28SMasami Hiramatsu (Google)And replace the ``%cx`` with the ``next``. 189