1=============
2Event Tracing
3=============
4
5:Author: Theodore Ts'o
6:Updated: Li Zefan and Tom Zanussi
7
81. Introduction
9===============
10
11Tracepoints (see Documentation/trace/tracepoints.rst) can be used
12without creating custom kernel modules to register probe functions
13using the event tracing infrastructure.
14
15Not all tracepoints can be traced using the event tracing system;
16the kernel developer must provide code snippets which define how the
17tracing information is saved into the tracing buffer, and how the
18tracing information should be printed.
19
202. Using Event Tracing
21======================
22
232.1 Via the 'set_event' interface
24---------------------------------
25
26The events which are available for tracing can be found in the file
27/sys/kernel/tracing/available_events.
28
29To enable a particular event, such as 'sched_wakeup', simply echo it
30to /sys/kernel/tracing/set_event. For example::
31
32	# echo sched_wakeup >> /sys/kernel/tracing/set_event
33
34.. Note:: '>>' is necessary, otherwise it will firstly disable all the events.
35
36To disable an event, echo the event name to the set_event file prefixed
37with an exclamation point::
38
39	# echo '!sched_wakeup' >> /sys/kernel/tracing/set_event
40
41To disable all events, echo an empty line to the set_event file::
42
43	# echo > /sys/kernel/tracing/set_event
44
45To enable all events, echo ``*:*`` or ``*:`` to the set_event file::
46
47	# echo *:* > /sys/kernel/tracing/set_event
48
49The events are organized into subsystems, such as ext4, irq, sched,
50etc., and a full event name looks like this: <subsystem>:<event>.  The
51subsystem name is optional, but it is displayed in the available_events
52file.  All of the events in a subsystem can be specified via the syntax
53``<subsystem>:*``; for example, to enable all irq events, you can use the
54command::
55
56	# echo 'irq:*' > /sys/kernel/tracing/set_event
57
582.2 Via the 'enable' toggle
59---------------------------
60
61The events available are also listed in /sys/kernel/tracing/events/ hierarchy
62of directories.
63
64To enable event 'sched_wakeup'::
65
66	# echo 1 > /sys/kernel/tracing/events/sched/sched_wakeup/enable
67
68To disable it::
69
70	# echo 0 > /sys/kernel/tracing/events/sched/sched_wakeup/enable
71
72To enable all events in sched subsystem::
73
74	# echo 1 > /sys/kernel/tracing/events/sched/enable
75
76To enable all events::
77
78	# echo 1 > /sys/kernel/tracing/events/enable
79
80When reading one of these enable files, there are four results:
81
82 - 0 - all events this file affects are disabled
83 - 1 - all events this file affects are enabled
84 - X - there is a mixture of events enabled and disabled
85 - ? - this file does not affect any event
86
872.3 Boot option
88---------------
89
90In order to facilitate early boot debugging, use boot option::
91
92	trace_event=[event-list]
93
94event-list is a comma separated list of events. See section 2.1 for event
95format.
96
973. Defining an event-enabled tracepoint
98=======================================
99
100See The example provided in samples/trace_events
101
1024. Event formats
103================
104
105Each trace event has a 'format' file associated with it that contains
106a description of each field in a logged event.  This information can
107be used to parse the binary trace stream, and is also the place to
108find the field names that can be used in event filters (see section 5).
109
110It also displays the format string that will be used to print the
111event in text mode, along with the event name and ID used for
112profiling.
113
114Every event has a set of ``common`` fields associated with it; these are
115the fields prefixed with ``common_``.  The other fields vary between
116events and correspond to the fields defined in the TRACE_EVENT
117definition for that event.
118
119Each field in the format has the form::
120
121     field:field-type field-name; offset:N; size:N;
122
123where offset is the offset of the field in the trace record and size
124is the size of the data item, in bytes.
125
126For example, here's the information displayed for the 'sched_wakeup'
127event::
128
129	# cat /sys/kernel/tracing/events/sched/sched_wakeup/format
130
131	name: sched_wakeup
132	ID: 60
133	format:
134		field:unsigned short common_type;	offset:0;	size:2;
135		field:unsigned char common_flags;	offset:2;	size:1;
136		field:unsigned char common_preempt_count;	offset:3;	size:1;
137		field:int common_pid;	offset:4;	size:4;
138		field:int common_tgid;	offset:8;	size:4;
139
140		field:char comm[TASK_COMM_LEN];	offset:12;	size:16;
141		field:pid_t pid;	offset:28;	size:4;
142		field:int prio;	offset:32;	size:4;
143		field:int success;	offset:36;	size:4;
144		field:int cpu;	offset:40;	size:4;
145
146	print fmt: "task %s:%d [%d] success=%d [%03d]", REC->comm, REC->pid,
147		   REC->prio, REC->success, REC->cpu
148
149This event contains 10 fields, the first 5 common and the remaining 5
150event-specific.  All the fields for this event are numeric, except for
151'comm' which is a string, a distinction important for event filtering.
152
1535. Event filtering
154==================
155
156Trace events can be filtered in the kernel by associating boolean
157'filter expressions' with them.  As soon as an event is logged into
158the trace buffer, its fields are checked against the filter expression
159associated with that event type.  An event with field values that
160'match' the filter will appear in the trace output, and an event whose
161values don't match will be discarded.  An event with no filter
162associated with it matches everything, and is the default when no
163filter has been set for an event.
164
1655.1 Expression syntax
166---------------------
167
168A filter expression consists of one or more 'predicates' that can be
169combined using the logical operators '&&' and '||'.  A predicate is
170simply a clause that compares the value of a field contained within a
171logged event with a constant value and returns either 0 or 1 depending
172on whether the field value matched (1) or didn't match (0)::
173
174	  field-name relational-operator value
175
176Parentheses can be used to provide arbitrary logical groupings and
177double-quotes can be used to prevent the shell from interpreting
178operators as shell metacharacters.
179
180The field-names available for use in filters can be found in the
181'format' files for trace events (see section 4).
182
183The relational-operators depend on the type of the field being tested:
184
185The operators available for numeric fields are:
186
187==, !=, <, <=, >, >=, &
188
189And for string fields they are:
190
191==, !=, ~
192
193The glob (~) accepts a wild card character (\*,?) and character classes
194([). For example::
195
196  prev_comm ~ "*sh"
197  prev_comm ~ "sh*"
198  prev_comm ~ "*sh*"
199  prev_comm ~ "ba*sh"
200
201If the field is a pointer that points into user space (for example
202"filename" from sys_enter_openat), then you have to append ".ustring" to the
203field name::
204
205  filename.ustring ~ "password"
206
207As the kernel will have to know how to retrieve the memory that the pointer
208is at from user space.
209
210You can convert any long type to a function address and search by function name::
211
212  call_site.function == security_prepare_creds
213
214The above will filter when the field "call_site" falls on the address within
215"security_prepare_creds". That is, it will compare the value of "call_site" and
216the filter will return true if it is greater than or equal to the start of
217the function "security_prepare_creds" and less than the end of that function.
218
219The ".function" postfix can only be attached to values of size long, and can only
220be compared with "==" or "!=".
221
2225.2 Setting filters
223-------------------
224
225A filter for an individual event is set by writing a filter expression
226to the 'filter' file for the given event.
227
228For example::
229
230	# cd /sys/kernel/tracing/events/sched/sched_wakeup
231	# echo "common_preempt_count > 4" > filter
232
233A slightly more involved example::
234
235	# cd /sys/kernel/tracing/events/signal/signal_generate
236	# echo "((sig >= 10 && sig < 15) || sig == 17) && comm != bash" > filter
237
238If there is an error in the expression, you'll get an 'Invalid
239argument' error when setting it, and the erroneous string along with
240an error message can be seen by looking at the filter e.g.::
241
242	# cd /sys/kernel/tracing/events/signal/signal_generate
243	# echo "((sig >= 10 && sig < 15) || dsig == 17) && comm != bash" > filter
244	-bash: echo: write error: Invalid argument
245	# cat filter
246	((sig >= 10 && sig < 15) || dsig == 17) && comm != bash
247	^
248	parse_error: Field not found
249
250Currently the caret ('^') for an error always appears at the beginning of
251the filter string; the error message should still be useful though
252even without more accurate position info.
253
2545.2.1 Filter limitations
255------------------------
256
257If a filter is placed on a string pointer ``(char *)`` that does not point
258to a string on the ring buffer, but instead points to kernel or user space
259memory, then, for safety reasons, at most 1024 bytes of the content is
260copied onto a temporary buffer to do the compare. If the copy of the memory
261faults (the pointer points to memory that should not be accessed), then the
262string compare will be treated as not matching.
263
2645.3 Clearing filters
265--------------------
266
267To clear the filter for an event, write a '0' to the event's filter
268file.
269
270To clear the filters for all events in a subsystem, write a '0' to the
271subsystem's filter file.
272
2735.4 Subsystem filters
274---------------------
275
276For convenience, filters for every event in a subsystem can be set or
277cleared as a group by writing a filter expression into the filter file
278at the root of the subsystem.  Note however, that if a filter for any
279event within the subsystem lacks a field specified in the subsystem
280filter, or if the filter can't be applied for any other reason, the
281filter for that event will retain its previous setting.  This can
282result in an unintended mixture of filters which could lead to
283confusing (to the user who might think different filters are in
284effect) trace output.  Only filters that reference just the common
285fields can be guaranteed to propagate successfully to all events.
286
287Here are a few subsystem filter examples that also illustrate the
288above points:
289
290Clear the filters on all events in the sched subsystem::
291
292	# cd /sys/kernel/tracing/events/sched
293	# echo 0 > filter
294	# cat sched_switch/filter
295	none
296	# cat sched_wakeup/filter
297	none
298
299Set a filter using only common fields for all events in the sched
300subsystem (all events end up with the same filter)::
301
302	# cd /sys/kernel/tracing/events/sched
303	# echo common_pid == 0 > filter
304	# cat sched_switch/filter
305	common_pid == 0
306	# cat sched_wakeup/filter
307	common_pid == 0
308
309Attempt to set a filter using a non-common field for all events in the
310sched subsystem (all events but those that have a prev_pid field retain
311their old filters)::
312
313	# cd /sys/kernel/tracing/events/sched
314	# echo prev_pid == 0 > filter
315	# cat sched_switch/filter
316	prev_pid == 0
317	# cat sched_wakeup/filter
318	common_pid == 0
319
3205.5 PID filtering
321-----------------
322
323The set_event_pid file in the same directory as the top events directory
324exists, will filter all events from tracing any task that does not have the
325PID listed in the set_event_pid file.
326::
327
328	# cd /sys/kernel/tracing
329	# echo $$ > set_event_pid
330	# echo 1 > events/enable
331
332Will only trace events for the current task.
333
334To add more PIDs without losing the PIDs already included, use '>>'.
335::
336
337	# echo 123 244 1 >> set_event_pid
338
339
3406. Event triggers
341=================
342
343Trace events can be made to conditionally invoke trigger 'commands'
344which can take various forms and are described in detail below;
345examples would be enabling or disabling other trace events or invoking
346a stack trace whenever the trace event is hit.  Whenever a trace event
347with attached triggers is invoked, the set of trigger commands
348associated with that event is invoked.  Any given trigger can
349additionally have an event filter of the same form as described in
350section 5 (Event filtering) associated with it - the command will only
351be invoked if the event being invoked passes the associated filter.
352If no filter is associated with the trigger, it always passes.
353
354Triggers are added to and removed from a particular event by writing
355trigger expressions to the 'trigger' file for the given event.
356
357A given event can have any number of triggers associated with it,
358subject to any restrictions that individual commands may have in that
359regard.
360
361Event triggers are implemented on top of "soft" mode, which means that
362whenever a trace event has one or more triggers associated with it,
363the event is activated even if it isn't actually enabled, but is
364disabled in a "soft" mode.  That is, the tracepoint will be called,
365but just will not be traced, unless of course it's actually enabled.
366This scheme allows triggers to be invoked even for events that aren't
367enabled, and also allows the current event filter implementation to be
368used for conditionally invoking triggers.
369
370The syntax for event triggers is roughly based on the syntax for
371set_ftrace_filter 'ftrace filter commands' (see the 'Filter commands'
372section of Documentation/trace/ftrace.rst), but there are major
373differences and the implementation isn't currently tied to it in any
374way, so beware about making generalizations between the two.
375
376.. Note::
377     Writing into trace_marker (See Documentation/trace/ftrace.rst)
378     can also enable triggers that are written into
379     /sys/kernel/tracing/events/ftrace/print/trigger
380
3816.1 Expression syntax
382---------------------
383
384Triggers are added by echoing the command to the 'trigger' file::
385
386  # echo 'command[:count] [if filter]' > trigger
387
388Triggers are removed by echoing the same command but starting with '!'
389to the 'trigger' file::
390
391  # echo '!command[:count] [if filter]' > trigger
392
393The [if filter] part isn't used in matching commands when removing, so
394leaving that off in a '!' command will accomplish the same thing as
395having it in.
396
397The filter syntax is the same as that described in the 'Event
398filtering' section above.
399
400For ease of use, writing to the trigger file using '>' currently just
401adds or removes a single trigger and there's no explicit '>>' support
402('>' actually behaves like '>>') or truncation support to remove all
403triggers (you have to use '!' for each one added.)
404
4056.2 Supported trigger commands
406------------------------------
407
408The following commands are supported:
409
410- enable_event/disable_event
411
412  These commands can enable or disable another trace event whenever
413  the triggering event is hit.  When these commands are registered,
414  the other trace event is activated, but disabled in a "soft" mode.
415  That is, the tracepoint will be called, but just will not be traced.
416  The event tracepoint stays in this mode as long as there's a trigger
417  in effect that can trigger it.
418
419  For example, the following trigger causes kmalloc events to be
420  traced when a read system call is entered, and the :1 at the end
421  specifies that this enablement happens only once::
422
423	  # echo 'enable_event:kmem:kmalloc:1' > \
424	      /sys/kernel/tracing/events/syscalls/sys_enter_read/trigger
425
426  The following trigger causes kmalloc events to stop being traced
427  when a read system call exits.  This disablement happens on every
428  read system call exit::
429
430	  # echo 'disable_event:kmem:kmalloc' > \
431	      /sys/kernel/tracing/events/syscalls/sys_exit_read/trigger
432
433  The format is::
434
435      enable_event:<system>:<event>[:count]
436      disable_event:<system>:<event>[:count]
437
438  To remove the above commands::
439
440	  # echo '!enable_event:kmem:kmalloc:1' > \
441	      /sys/kernel/tracing/events/syscalls/sys_enter_read/trigger
442
443	  # echo '!disable_event:kmem:kmalloc' > \
444	      /sys/kernel/tracing/events/syscalls/sys_exit_read/trigger
445
446  Note that there can be any number of enable/disable_event triggers
447  per triggering event, but there can only be one trigger per
448  triggered event. e.g. sys_enter_read can have triggers enabling both
449  kmem:kmalloc and sched:sched_switch, but can't have two kmem:kmalloc
450  versions such as kmem:kmalloc and kmem:kmalloc:1 or 'kmem:kmalloc if
451  bytes_req == 256' and 'kmem:kmalloc if bytes_alloc == 256' (they
452  could be combined into a single filter on kmem:kmalloc though).
453
454- stacktrace
455
456  This command dumps a stacktrace in the trace buffer whenever the
457  triggering event occurs.
458
459  For example, the following trigger dumps a stacktrace every time the
460  kmalloc tracepoint is hit::
461
462	  # echo 'stacktrace' > \
463		/sys/kernel/tracing/events/kmem/kmalloc/trigger
464
465  The following trigger dumps a stacktrace the first 5 times a kmalloc
466  request happens with a size >= 64K::
467
468	  # echo 'stacktrace:5 if bytes_req >= 65536' > \
469		/sys/kernel/tracing/events/kmem/kmalloc/trigger
470
471  The format is::
472
473      stacktrace[:count]
474
475  To remove the above commands::
476
477	  # echo '!stacktrace' > \
478		/sys/kernel/tracing/events/kmem/kmalloc/trigger
479
480	  # echo '!stacktrace:5 if bytes_req >= 65536' > \
481		/sys/kernel/tracing/events/kmem/kmalloc/trigger
482
483  The latter can also be removed more simply by the following (without
484  the filter)::
485
486	  # echo '!stacktrace:5' > \
487		/sys/kernel/tracing/events/kmem/kmalloc/trigger
488
489  Note that there can be only one stacktrace trigger per triggering
490  event.
491
492- snapshot
493
494  This command causes a snapshot to be triggered whenever the
495  triggering event occurs.
496
497  The following command creates a snapshot every time a block request
498  queue is unplugged with a depth > 1.  If you were tracing a set of
499  events or functions at the time, the snapshot trace buffer would
500  capture those events when the trigger event occurred::
501
502	  # echo 'snapshot if nr_rq > 1' > \
503		/sys/kernel/tracing/events/block/block_unplug/trigger
504
505  To only snapshot once::
506
507	  # echo 'snapshot:1 if nr_rq > 1' > \
508		/sys/kernel/tracing/events/block/block_unplug/trigger
509
510  To remove the above commands::
511
512	  # echo '!snapshot if nr_rq > 1' > \
513		/sys/kernel/tracing/events/block/block_unplug/trigger
514
515	  # echo '!snapshot:1 if nr_rq > 1' > \
516		/sys/kernel/tracing/events/block/block_unplug/trigger
517
518  Note that there can be only one snapshot trigger per triggering
519  event.
520
521- traceon/traceoff
522
523  These commands turn tracing on and off when the specified events are
524  hit. The parameter determines how many times the tracing system is
525  turned on and off. If unspecified, there is no limit.
526
527  The following command turns tracing off the first time a block
528  request queue is unplugged with a depth > 1.  If you were tracing a
529  set of events or functions at the time, you could then examine the
530  trace buffer to see the sequence of events that led up to the
531  trigger event::
532
533	  # echo 'traceoff:1 if nr_rq > 1' > \
534		/sys/kernel/tracing/events/block/block_unplug/trigger
535
536  To always disable tracing when nr_rq  > 1::
537
538	  # echo 'traceoff if nr_rq > 1' > \
539		/sys/kernel/tracing/events/block/block_unplug/trigger
540
541  To remove the above commands::
542
543	  # echo '!traceoff:1 if nr_rq > 1' > \
544		/sys/kernel/tracing/events/block/block_unplug/trigger
545
546	  # echo '!traceoff if nr_rq > 1' > \
547		/sys/kernel/tracing/events/block/block_unplug/trigger
548
549  Note that there can be only one traceon or traceoff trigger per
550  triggering event.
551
552- hist
553
554  This command aggregates event hits into a hash table keyed on one or
555  more trace event format fields (or stacktrace) and a set of running
556  totals derived from one or more trace event format fields and/or
557  event counts (hitcount).
558
559  See Documentation/trace/histogram.rst for details and examples.
560
5617. In-kernel trace event API
562============================
563
564In most cases, the command-line interface to trace events is more than
565sufficient.  Sometimes, however, applications might find the need for
566more complex relationships than can be expressed through a simple
567series of linked command-line expressions, or putting together sets of
568commands may be simply too cumbersome.  An example might be an
569application that needs to 'listen' to the trace stream in order to
570maintain an in-kernel state machine detecting, for instance, when an
571illegal kernel state occurs in the scheduler.
572
573The trace event subsystem provides an in-kernel API allowing modules
574or other kernel code to generate user-defined 'synthetic' events at
575will, which can be used to either augment the existing trace stream
576and/or signal that a particular important state has occurred.
577
578A similar in-kernel API is also available for creating kprobe and
579kretprobe events.
580
581Both the synthetic event and k/ret/probe event APIs are built on top
582of a lower-level "dynevent_cmd" event command API, which is also
583available for more specialized applications, or as the basis of other
584higher-level trace event APIs.
585
586The API provided for these purposes is describe below and allows the
587following:
588
589  - dynamically creating synthetic event definitions
590  - dynamically creating kprobe and kretprobe event definitions
591  - tracing synthetic events from in-kernel code
592  - the low-level "dynevent_cmd" API
593
5947.1 Dyamically creating synthetic event definitions
595---------------------------------------------------
596
597There are a couple ways to create a new synthetic event from a kernel
598module or other kernel code.
599
600The first creates the event in one step, using synth_event_create().
601In this method, the name of the event to create and an array defining
602the fields is supplied to synth_event_create().  If successful, a
603synthetic event with that name and fields will exist following that
604call.  For example, to create a new "schedtest" synthetic event::
605
606  ret = synth_event_create("schedtest", sched_fields,
607                           ARRAY_SIZE(sched_fields), THIS_MODULE);
608
609The sched_fields param in this example points to an array of struct
610synth_field_desc, each of which describes an event field by type and
611name::
612
613  static struct synth_field_desc sched_fields[] = {
614        { .type = "pid_t",              .name = "next_pid_field" },
615        { .type = "char[16]",           .name = "next_comm_field" },
616        { .type = "u64",                .name = "ts_ns" },
617        { .type = "u64",                .name = "ts_ms" },
618        { .type = "unsigned int",       .name = "cpu" },
619        { .type = "char[64]",           .name = "my_string_field" },
620        { .type = "int",                .name = "my_int_field" },
621  };
622
623See synth_field_size() for available types.
624
625If field_name contains [n], the field is considered to be a static array.
626
627If field_names contains[] (no subscript), the field is considered to
628be a dynamic array, which will only take as much space in the event as
629is required to hold the array.
630
631Because space for an event is reserved before assigning field values
632to the event, using dynamic arrays implies that the piecewise
633in-kernel API described below can't be used with dynamic arrays.  The
634other non-piecewise in-kernel APIs can, however, be used with dynamic
635arrays.
636
637If the event is created from within a module, a pointer to the module
638must be passed to synth_event_create().  This will ensure that the
639trace buffer won't contain unreadable events when the module is
640removed.
641
642At this point, the event object is ready to be used for generating new
643events.
644
645In the second method, the event is created in several steps.  This
646allows events to be created dynamically and without the need to create
647and populate an array of fields beforehand.
648
649To use this method, an empty or partially empty synthetic event should
650first be created using synth_event_gen_cmd_start() or
651synth_event_gen_cmd_array_start().  For synth_event_gen_cmd_start(),
652the name of the event along with one or more pairs of args each pair
653representing a 'type field_name;' field specification should be
654supplied.  For synth_event_gen_cmd_array_start(), the name of the
655event along with an array of struct synth_field_desc should be
656supplied. Before calling synth_event_gen_cmd_start() or
657synth_event_gen_cmd_array_start(), the user should create and
658initialize a dynevent_cmd object using synth_event_cmd_init().
659
660For example, to create a new "schedtest" synthetic event with two
661fields::
662
663  struct dynevent_cmd cmd;
664  char *buf;
665
666  /* Create a buffer to hold the generated command */
667  buf = kzalloc(MAX_DYNEVENT_CMD_LEN, GFP_KERNEL);
668
669  /* Before generating the command, initialize the cmd object */
670  synth_event_cmd_init(&cmd, buf, MAX_DYNEVENT_CMD_LEN);
671
672  ret = synth_event_gen_cmd_start(&cmd, "schedtest", THIS_MODULE,
673                                  "pid_t", "next_pid_field",
674                                  "u64", "ts_ns");
675
676Alternatively, using an array of struct synth_field_desc fields
677containing the same information::
678
679  ret = synth_event_gen_cmd_array_start(&cmd, "schedtest", THIS_MODULE,
680                                        fields, n_fields);
681
682Once the synthetic event object has been created, it can then be
683populated with more fields.  Fields are added one by one using
684synth_event_add_field(), supplying the dynevent_cmd object, a field
685type, and a field name.  For example, to add a new int field named
686"intfield", the following call should be made::
687
688  ret = synth_event_add_field(&cmd, "int", "intfield");
689
690See synth_field_size() for available types. If field_name contains [n]
691the field is considered to be an array.
692
693A group of fields can also be added all at once using an array of
694synth_field_desc with add_synth_fields().  For example, this would add
695just the first four sched_fields::
696
697  ret = synth_event_add_fields(&cmd, sched_fields, 4);
698
699If you already have a string of the form 'type field_name',
700synth_event_add_field_str() can be used to add it as-is; it will
701also automatically append a ';' to the string.
702
703Once all the fields have been added, the event should be finalized and
704registered by calling the synth_event_gen_cmd_end() function::
705
706  ret = synth_event_gen_cmd_end(&cmd);
707
708At this point, the event object is ready to be used for tracing new
709events.
710
7117.2 Tracing synthetic events from in-kernel code
712------------------------------------------------
713
714To trace a synthetic event, there are several options.  The first
715option is to trace the event in one call, using synth_event_trace()
716with a variable number of values, or synth_event_trace_array() with an
717array of values to be set.  A second option can be used to avoid the
718need for a pre-formed array of values or list of arguments, via
719synth_event_trace_start() and synth_event_trace_end() along with
720synth_event_add_next_val() or synth_event_add_val() to add the values
721piecewise.
722
7237.2.1 Tracing a synthetic event all at once
724-------------------------------------------
725
726To trace a synthetic event all at once, the synth_event_trace() or
727synth_event_trace_array() functions can be used.
728
729The synth_event_trace() function is passed the trace_event_file
730representing the synthetic event (which can be retrieved using
731trace_get_event_file() using the synthetic event name, "synthetic" as
732the system name, and the trace instance name (NULL if using the global
733trace array)), along with an variable number of u64 args, one for each
734synthetic event field, and the number of values being passed.
735
736So, to trace an event corresponding to the synthetic event definition
737above, code like the following could be used::
738
739  ret = synth_event_trace(create_synth_test, 7, /* number of values */
740                          444,             /* next_pid_field */
741                          (u64)"clackers", /* next_comm_field */
742                          1000000,         /* ts_ns */
743                          1000,            /* ts_ms */
744                          smp_processor_id(),/* cpu */
745                          (u64)"Thneed",   /* my_string_field */
746                          999);            /* my_int_field */
747
748All vals should be cast to u64, and string vals are just pointers to
749strings, cast to u64.  Strings will be copied into space reserved in
750the event for the string, using these pointers.
751
752Alternatively, the synth_event_trace_array() function can be used to
753accomplish the same thing.  It is passed the trace_event_file
754representing the synthetic event (which can be retrieved using
755trace_get_event_file() using the synthetic event name, "synthetic" as
756the system name, and the trace instance name (NULL if using the global
757trace array)), along with an array of u64, one for each synthetic
758event field.
759
760To trace an event corresponding to the synthetic event definition
761above, code like the following could be used::
762
763  u64 vals[7];
764
765  vals[0] = 777;                  /* next_pid_field */
766  vals[1] = (u64)"tiddlywinks";   /* next_comm_field */
767  vals[2] = 1000000;              /* ts_ns */
768  vals[3] = 1000;                 /* ts_ms */
769  vals[4] = smp_processor_id();   /* cpu */
770  vals[5] = (u64)"thneed";        /* my_string_field */
771  vals[6] = 398;                  /* my_int_field */
772
773The 'vals' array is just an array of u64, the number of which must
774match the number of field in the synthetic event, and which must be in
775the same order as the synthetic event fields.
776
777All vals should be cast to u64, and string vals are just pointers to
778strings, cast to u64.  Strings will be copied into space reserved in
779the event for the string, using these pointers.
780
781In order to trace a synthetic event, a pointer to the trace event file
782is needed.  The trace_get_event_file() function can be used to get
783it - it will find the file in the given trace instance (in this case
784NULL since the top trace array is being used) while at the same time
785preventing the instance containing it from going away::
786
787       schedtest_event_file = trace_get_event_file(NULL, "synthetic",
788                                                   "schedtest");
789
790Before tracing the event, it should be enabled in some way, otherwise
791the synthetic event won't actually show up in the trace buffer.
792
793To enable a synthetic event from the kernel, trace_array_set_clr_event()
794can be used (which is not specific to synthetic events, so does need
795the "synthetic" system name to be specified explicitly).
796
797To enable the event, pass 'true' to it::
798
799       trace_array_set_clr_event(schedtest_event_file->tr,
800                                 "synthetic", "schedtest", true);
801
802To disable it pass false::
803
804       trace_array_set_clr_event(schedtest_event_file->tr,
805                                 "synthetic", "schedtest", false);
806
807Finally, synth_event_trace_array() can be used to actually trace the
808event, which should be visible in the trace buffer afterwards::
809
810       ret = synth_event_trace_array(schedtest_event_file, vals,
811                                     ARRAY_SIZE(vals));
812
813To remove the synthetic event, the event should be disabled, and the
814trace instance should be 'put' back using trace_put_event_file()::
815
816       trace_array_set_clr_event(schedtest_event_file->tr,
817                                 "synthetic", "schedtest", false);
818       trace_put_event_file(schedtest_event_file);
819
820If those have been successful, synth_event_delete() can be called to
821remove the event::
822
823       ret = synth_event_delete("schedtest");
824
8257.2.2 Tracing a synthetic event piecewise
826-----------------------------------------
827
828To trace a synthetic using the piecewise method described above, the
829synth_event_trace_start() function is used to 'open' the synthetic
830event trace::
831
832       struct synth_event_trace_state trace_state;
833
834       ret = synth_event_trace_start(schedtest_event_file, &trace_state);
835
836It's passed the trace_event_file representing the synthetic event
837using the same methods as described above, along with a pointer to a
838struct synth_event_trace_state object, which will be zeroed before use and
839used to maintain state between this and following calls.
840
841Once the event has been opened, which means space for it has been
842reserved in the trace buffer, the individual fields can be set.  There
843are two ways to do that, either one after another for each field in
844the event, which requires no lookups, or by name, which does.  The
845tradeoff is flexibility in doing the assignments vs the cost of a
846lookup per field.
847
848To assign the values one after the other without lookups,
849synth_event_add_next_val() should be used.  Each call is passed the
850same synth_event_trace_state object used in the synth_event_trace_start(),
851along with the value to set the next field in the event.  After each
852field is set, the 'cursor' points to the next field, which will be set
853by the subsequent call, continuing until all the fields have been set
854in order.  The same sequence of calls as in the above examples using
855this method would be (without error-handling code)::
856
857       /* next_pid_field */
858       ret = synth_event_add_next_val(777, &trace_state);
859
860       /* next_comm_field */
861       ret = synth_event_add_next_val((u64)"slinky", &trace_state);
862
863       /* ts_ns */
864       ret = synth_event_add_next_val(1000000, &trace_state);
865
866       /* ts_ms */
867       ret = synth_event_add_next_val(1000, &trace_state);
868
869       /* cpu */
870       ret = synth_event_add_next_val(smp_processor_id(), &trace_state);
871
872       /* my_string_field */
873       ret = synth_event_add_next_val((u64)"thneed_2.01", &trace_state);
874
875       /* my_int_field */
876       ret = synth_event_add_next_val(395, &trace_state);
877
878To assign the values in any order, synth_event_add_val() should be
879used.  Each call is passed the same synth_event_trace_state object used in
880the synth_event_trace_start(), along with the field name of the field
881to set and the value to set it to.  The same sequence of calls as in
882the above examples using this method would be (without error-handling
883code)::
884
885       ret = synth_event_add_val("next_pid_field", 777, &trace_state);
886       ret = synth_event_add_val("next_comm_field", (u64)"silly putty",
887                                 &trace_state);
888       ret = synth_event_add_val("ts_ns", 1000000, &trace_state);
889       ret = synth_event_add_val("ts_ms", 1000, &trace_state);
890       ret = synth_event_add_val("cpu", smp_processor_id(), &trace_state);
891       ret = synth_event_add_val("my_string_field", (u64)"thneed_9",
892                                 &trace_state);
893       ret = synth_event_add_val("my_int_field", 3999, &trace_state);
894
895Note that synth_event_add_next_val() and synth_event_add_val() are
896incompatible if used within the same trace of an event - either one
897can be used but not both at the same time.
898
899Finally, the event won't be actually traced until it's 'closed',
900which is done using synth_event_trace_end(), which takes only the
901struct synth_event_trace_state object used in the previous calls::
902
903       ret = synth_event_trace_end(&trace_state);
904
905Note that synth_event_trace_end() must be called at the end regardless
906of whether any of the add calls failed (say due to a bad field name
907being passed in).
908
9097.3 Dyamically creating kprobe and kretprobe event definitions
910--------------------------------------------------------------
911
912To create a kprobe or kretprobe trace event from kernel code, the
913kprobe_event_gen_cmd_start() or kretprobe_event_gen_cmd_start()
914functions can be used.
915
916To create a kprobe event, an empty or partially empty kprobe event
917should first be created using kprobe_event_gen_cmd_start().  The name
918of the event and the probe location should be specfied along with one
919or args each representing a probe field should be supplied to this
920function.  Before calling kprobe_event_gen_cmd_start(), the user
921should create and initialize a dynevent_cmd object using
922kprobe_event_cmd_init().
923
924For example, to create a new "schedtest" kprobe event with two fields::
925
926  struct dynevent_cmd cmd;
927  char *buf;
928
929  /* Create a buffer to hold the generated command */
930  buf = kzalloc(MAX_DYNEVENT_CMD_LEN, GFP_KERNEL);
931
932  /* Before generating the command, initialize the cmd object */
933  kprobe_event_cmd_init(&cmd, buf, MAX_DYNEVENT_CMD_LEN);
934
935  /*
936   * Define the gen_kprobe_test event with the first 2 kprobe
937   * fields.
938   */
939  ret = kprobe_event_gen_cmd_start(&cmd, "gen_kprobe_test", "do_sys_open",
940                                   "dfd=%ax", "filename=%dx");
941
942Once the kprobe event object has been created, it can then be
943populated with more fields.  Fields can be added using
944kprobe_event_add_fields(), supplying the dynevent_cmd object along
945with a variable arg list of probe fields.  For example, to add a
946couple additional fields, the following call could be made::
947
948  ret = kprobe_event_add_fields(&cmd, "flags=%cx", "mode=+4($stack)");
949
950Once all the fields have been added, the event should be finalized and
951registered by calling the kprobe_event_gen_cmd_end() or
952kretprobe_event_gen_cmd_end() functions, depending on whether a kprobe
953or kretprobe command was started::
954
955  ret = kprobe_event_gen_cmd_end(&cmd);
956
957or::
958
959  ret = kretprobe_event_gen_cmd_end(&cmd);
960
961At this point, the event object is ready to be used for tracing new
962events.
963
964Similarly, a kretprobe event can be created using
965kretprobe_event_gen_cmd_start() with a probe name and location and
966additional params such as $retval::
967
968  ret = kretprobe_event_gen_cmd_start(&cmd, "gen_kretprobe_test",
969                                      "do_sys_open", "$retval");
970
971Similar to the synthetic event case, code like the following can be
972used to enable the newly created kprobe event::
973
974  gen_kprobe_test = trace_get_event_file(NULL, "kprobes", "gen_kprobe_test");
975
976  ret = trace_array_set_clr_event(gen_kprobe_test->tr,
977                                  "kprobes", "gen_kprobe_test", true);
978
979Finally, also similar to synthetic events, the following code can be
980used to give the kprobe event file back and delete the event::
981
982  trace_put_event_file(gen_kprobe_test);
983
984  ret = kprobe_event_delete("gen_kprobe_test");
985
9867.4 The "dynevent_cmd" low-level API
987------------------------------------
988
989Both the in-kernel synthetic event and kprobe interfaces are built on
990top of a lower-level "dynevent_cmd" interface.  This interface is
991meant to provide the basis for higher-level interfaces such as the
992synthetic and kprobe interfaces, which can be used as examples.
993
994The basic idea is simple and amounts to providing a general-purpose
995layer that can be used to generate trace event commands.  The
996generated command strings can then be passed to the command-parsing
997and event creation code that already exists in the trace event
998subystem for creating the corresponding trace events.
999
1000In a nutshell, the way it works is that the higher-level interface
1001code creates a struct dynevent_cmd object, then uses a couple
1002functions, dynevent_arg_add() and dynevent_arg_pair_add() to build up
1003a command string, which finally causes the command to be executed
1004using the dynevent_create() function.  The details of the interface
1005are described below.
1006
1007The first step in building a new command string is to create and
1008initialize an instance of a dynevent_cmd.  Here, for instance, we
1009create a dynevent_cmd on the stack and initialize it::
1010
1011  struct dynevent_cmd cmd;
1012  char *buf;
1013  int ret;
1014
1015  buf = kzalloc(MAX_DYNEVENT_CMD_LEN, GFP_KERNEL);
1016
1017  dynevent_cmd_init(cmd, buf, maxlen, DYNEVENT_TYPE_FOO,
1018                    foo_event_run_command);
1019
1020The dynevent_cmd initialization needs to be given a user-specified
1021buffer and the length of the buffer (MAX_DYNEVENT_CMD_LEN can be used
1022for this purpose - at 2k it's generally too big to be comfortably put
1023on the stack, so is dynamically allocated), a dynevent type id, which
1024is meant to be used to check that further API calls are for the
1025correct command type, and a pointer to an event-specific run_command()
1026callback that will be called to actually execute the event-specific
1027command function.
1028
1029Once that's done, the command string can by built up by successive
1030calls to argument-adding functions.
1031
1032To add a single argument, define and initialize a struct dynevent_arg
1033or struct dynevent_arg_pair object.  Here's an example of the simplest
1034possible arg addition, which is simply to append the given string as
1035a whitespace-separated argument to the command::
1036
1037  struct dynevent_arg arg;
1038
1039  dynevent_arg_init(&arg, NULL, 0);
1040
1041  arg.str = name;
1042
1043  ret = dynevent_arg_add(cmd, &arg);
1044
1045The arg object is first initialized using dynevent_arg_init() and in
1046this case the parameters are NULL or 0, which means there's no
1047optional sanity-checking function or separator appended to the end of
1048the arg.
1049
1050Here's another more complicated example using an 'arg pair', which is
1051used to create an argument that consists of a couple components added
1052together as a unit, for example, a 'type field_name;' arg or a simple
1053expression arg e.g. 'flags=%cx'::
1054
1055  struct dynevent_arg_pair arg_pair;
1056
1057  dynevent_arg_pair_init(&arg_pair, dynevent_foo_check_arg_fn, 0, ';');
1058
1059  arg_pair.lhs = type;
1060  arg_pair.rhs = name;
1061
1062  ret = dynevent_arg_pair_add(cmd, &arg_pair);
1063
1064Again, the arg_pair is first initialized, in this case with a callback
1065function used to check the sanity of the args (for example, that
1066neither part of the pair is NULL), along with a character to be used
1067to add an operator between the pair (here none) and a separator to be
1068appended onto the end of the arg pair (here ';').
1069
1070There's also a dynevent_str_add() function that can be used to simply
1071add a string as-is, with no spaces, delimeters, or arg check.
1072
1073Any number of dynevent_*_add() calls can be made to build up the string
1074(until its length surpasses cmd->maxlen).  When all the arguments have
1075been added and the command string is complete, the only thing left to
1076do is run the command, which happens by simply calling
1077dynevent_create()::
1078
1079  ret = dynevent_create(&cmd);
1080
1081At that point, if the return value is 0, the dynamic event has been
1082created and is ready to use.
1083
1084See the dynevent_cmd function definitions themselves for the details
1085of the API.
1086