1============= 2Event Tracing 3============= 4 5:Author: Theodore Ts'o 6:Updated: Li Zefan and Tom Zanussi 7 81. Introduction 9=============== 10 11Tracepoints (see Documentation/trace/tracepoints.rst) can be used 12without creating custom kernel modules to register probe functions 13using the event tracing infrastructure. 14 15Not all tracepoints can be traced using the event tracing system; 16the kernel developer must provide code snippets which define how the 17tracing information is saved into the tracing buffer, and how the 18tracing information should be printed. 19 202. Using Event Tracing 21====================== 22 232.1 Via the 'set_event' interface 24--------------------------------- 25 26The events which are available for tracing can be found in the file 27/sys/kernel/debug/tracing/available_events. 28 29To enable a particular event, such as 'sched_wakeup', simply echo it 30to /sys/kernel/debug/tracing/set_event. For example:: 31 32 # echo sched_wakeup >> /sys/kernel/debug/tracing/set_event 33 34.. Note:: '>>' is necessary, otherwise it will firstly disable all the events. 35 36To disable an event, echo the event name to the set_event file prefixed 37with an exclamation point:: 38 39 # echo '!sched_wakeup' >> /sys/kernel/debug/tracing/set_event 40 41To disable all events, echo an empty line to the set_event file:: 42 43 # echo > /sys/kernel/debug/tracing/set_event 44 45To enable all events, echo ``*:*`` or ``*:`` to the set_event file:: 46 47 # echo *:* > /sys/kernel/debug/tracing/set_event 48 49The events are organized into subsystems, such as ext4, irq, sched, 50etc., and a full event name looks like this: <subsystem>:<event>. The 51subsystem name is optional, but it is displayed in the available_events 52file. All of the events in a subsystem can be specified via the syntax 53``<subsystem>:*``; for example, to enable all irq events, you can use the 54command:: 55 56 # echo 'irq:*' > /sys/kernel/debug/tracing/set_event 57 582.2 Via the 'enable' toggle 59--------------------------- 60 61The events available are also listed in /sys/kernel/debug/tracing/events/ hierarchy 62of directories. 63 64To enable event 'sched_wakeup':: 65 66 # echo 1 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable 67 68To disable it:: 69 70 # echo 0 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable 71 72To enable all events in sched subsystem:: 73 74 # echo 1 > /sys/kernel/debug/tracing/events/sched/enable 75 76To enable all events:: 77 78 # echo 1 > /sys/kernel/debug/tracing/events/enable 79 80When reading one of these enable files, there are four results: 81 82 - 0 - all events this file affects are disabled 83 - 1 - all events this file affects are enabled 84 - X - there is a mixture of events enabled and disabled 85 - ? - this file does not affect any event 86 872.3 Boot option 88--------------- 89 90In order to facilitate early boot debugging, use boot option:: 91 92 trace_event=[event-list] 93 94event-list is a comma separated list of events. See section 2.1 for event 95format. 96 973. Defining an event-enabled tracepoint 98======================================= 99 100See The example provided in samples/trace_events 101 1024. Event formats 103================ 104 105Each trace event has a 'format' file associated with it that contains 106a description of each field in a logged event. This information can 107be used to parse the binary trace stream, and is also the place to 108find the field names that can be used in event filters (see section 5). 109 110It also displays the format string that will be used to print the 111event in text mode, along with the event name and ID used for 112profiling. 113 114Every event has a set of ``common`` fields associated with it; these are 115the fields prefixed with ``common_``. The other fields vary between 116events and correspond to the fields defined in the TRACE_EVENT 117definition for that event. 118 119Each field in the format has the form:: 120 121 field:field-type field-name; offset:N; size:N; 122 123where offset is the offset of the field in the trace record and size 124is the size of the data item, in bytes. 125 126For example, here's the information displayed for the 'sched_wakeup' 127event:: 128 129 # cat /sys/kernel/debug/tracing/events/sched/sched_wakeup/format 130 131 name: sched_wakeup 132 ID: 60 133 format: 134 field:unsigned short common_type; offset:0; size:2; 135 field:unsigned char common_flags; offset:2; size:1; 136 field:unsigned char common_preempt_count; offset:3; size:1; 137 field:int common_pid; offset:4; size:4; 138 field:int common_tgid; offset:8; size:4; 139 140 field:char comm[TASK_COMM_LEN]; offset:12; size:16; 141 field:pid_t pid; offset:28; size:4; 142 field:int prio; offset:32; size:4; 143 field:int success; offset:36; size:4; 144 field:int cpu; offset:40; size:4; 145 146 print fmt: "task %s:%d [%d] success=%d [%03d]", REC->comm, REC->pid, 147 REC->prio, REC->success, REC->cpu 148 149This event contains 10 fields, the first 5 common and the remaining 5 150event-specific. All the fields for this event are numeric, except for 151'comm' which is a string, a distinction important for event filtering. 152 1535. Event filtering 154================== 155 156Trace events can be filtered in the kernel by associating boolean 157'filter expressions' with them. As soon as an event is logged into 158the trace buffer, its fields are checked against the filter expression 159associated with that event type. An event with field values that 160'match' the filter will appear in the trace output, and an event whose 161values don't match will be discarded. An event with no filter 162associated with it matches everything, and is the default when no 163filter has been set for an event. 164 1655.1 Expression syntax 166--------------------- 167 168A filter expression consists of one or more 'predicates' that can be 169combined using the logical operators '&&' and '||'. A predicate is 170simply a clause that compares the value of a field contained within a 171logged event with a constant value and returns either 0 or 1 depending 172on whether the field value matched (1) or didn't match (0):: 173 174 field-name relational-operator value 175 176Parentheses can be used to provide arbitrary logical groupings and 177double-quotes can be used to prevent the shell from interpreting 178operators as shell metacharacters. 179 180The field-names available for use in filters can be found in the 181'format' files for trace events (see section 4). 182 183The relational-operators depend on the type of the field being tested: 184 185The operators available for numeric fields are: 186 187==, !=, <, <=, >, >=, & 188 189And for string fields they are: 190 191==, !=, ~ 192 193The glob (~) accepts a wild card character (\*,?) and character classes 194([). For example:: 195 196 prev_comm ~ "*sh" 197 prev_comm ~ "sh*" 198 prev_comm ~ "*sh*" 199 prev_comm ~ "ba*sh" 200 2015.2 Setting filters 202------------------- 203 204A filter for an individual event is set by writing a filter expression 205to the 'filter' file for the given event. 206 207For example:: 208 209 # cd /sys/kernel/debug/tracing/events/sched/sched_wakeup 210 # echo "common_preempt_count > 4" > filter 211 212A slightly more involved example:: 213 214 # cd /sys/kernel/debug/tracing/events/signal/signal_generate 215 # echo "((sig >= 10 && sig < 15) || sig == 17) && comm != bash" > filter 216 217If there is an error in the expression, you'll get an 'Invalid 218argument' error when setting it, and the erroneous string along with 219an error message can be seen by looking at the filter e.g.:: 220 221 # cd /sys/kernel/debug/tracing/events/signal/signal_generate 222 # echo "((sig >= 10 && sig < 15) || dsig == 17) && comm != bash" > filter 223 -bash: echo: write error: Invalid argument 224 # cat filter 225 ((sig >= 10 && sig < 15) || dsig == 17) && comm != bash 226 ^ 227 parse_error: Field not found 228 229Currently the caret ('^') for an error always appears at the beginning of 230the filter string; the error message should still be useful though 231even without more accurate position info. 232 2335.3 Clearing filters 234-------------------- 235 236To clear the filter for an event, write a '0' to the event's filter 237file. 238 239To clear the filters for all events in a subsystem, write a '0' to the 240subsystem's filter file. 241 2425.3 Subsystem filters 243--------------------- 244 245For convenience, filters for every event in a subsystem can be set or 246cleared as a group by writing a filter expression into the filter file 247at the root of the subsystem. Note however, that if a filter for any 248event within the subsystem lacks a field specified in the subsystem 249filter, or if the filter can't be applied for any other reason, the 250filter for that event will retain its previous setting. This can 251result in an unintended mixture of filters which could lead to 252confusing (to the user who might think different filters are in 253effect) trace output. Only filters that reference just the common 254fields can be guaranteed to propagate successfully to all events. 255 256Here are a few subsystem filter examples that also illustrate the 257above points: 258 259Clear the filters on all events in the sched subsystem:: 260 261 # cd /sys/kernel/debug/tracing/events/sched 262 # echo 0 > filter 263 # cat sched_switch/filter 264 none 265 # cat sched_wakeup/filter 266 none 267 268Set a filter using only common fields for all events in the sched 269subsystem (all events end up with the same filter):: 270 271 # cd /sys/kernel/debug/tracing/events/sched 272 # echo common_pid == 0 > filter 273 # cat sched_switch/filter 274 common_pid == 0 275 # cat sched_wakeup/filter 276 common_pid == 0 277 278Attempt to set a filter using a non-common field for all events in the 279sched subsystem (all events but those that have a prev_pid field retain 280their old filters):: 281 282 # cd /sys/kernel/debug/tracing/events/sched 283 # echo prev_pid == 0 > filter 284 # cat sched_switch/filter 285 prev_pid == 0 286 # cat sched_wakeup/filter 287 common_pid == 0 288 2895.4 PID filtering 290----------------- 291 292The set_event_pid file in the same directory as the top events directory 293exists, will filter all events from tracing any task that does not have the 294PID listed in the set_event_pid file. 295:: 296 297 # cd /sys/kernel/debug/tracing 298 # echo $$ > set_event_pid 299 # echo 1 > events/enable 300 301Will only trace events for the current task. 302 303To add more PIDs without losing the PIDs already included, use '>>'. 304:: 305 306 # echo 123 244 1 >> set_event_pid 307 308 3096. Event triggers 310================= 311 312Trace events can be made to conditionally invoke trigger 'commands' 313which can take various forms and are described in detail below; 314examples would be enabling or disabling other trace events or invoking 315a stack trace whenever the trace event is hit. Whenever a trace event 316with attached triggers is invoked, the set of trigger commands 317associated with that event is invoked. Any given trigger can 318additionally have an event filter of the same form as described in 319section 5 (Event filtering) associated with it - the command will only 320be invoked if the event being invoked passes the associated filter. 321If no filter is associated with the trigger, it always passes. 322 323Triggers are added to and removed from a particular event by writing 324trigger expressions to the 'trigger' file for the given event. 325 326A given event can have any number of triggers associated with it, 327subject to any restrictions that individual commands may have in that 328regard. 329 330Event triggers are implemented on top of "soft" mode, which means that 331whenever a trace event has one or more triggers associated with it, 332the event is activated even if it isn't actually enabled, but is 333disabled in a "soft" mode. That is, the tracepoint will be called, 334but just will not be traced, unless of course it's actually enabled. 335This scheme allows triggers to be invoked even for events that aren't 336enabled, and also allows the current event filter implementation to be 337used for conditionally invoking triggers. 338 339The syntax for event triggers is roughly based on the syntax for 340set_ftrace_filter 'ftrace filter commands' (see the 'Filter commands' 341section of Documentation/trace/ftrace.rst), but there are major 342differences and the implementation isn't currently tied to it in any 343way, so beware about making generalizations between the two. 344 345Note: Writing into trace_marker (See Documentation/trace/ftrace.rst) 346 can also enable triggers that are written into 347 /sys/kernel/tracing/events/ftrace/print/trigger 348 3496.1 Expression syntax 350--------------------- 351 352Triggers are added by echoing the command to the 'trigger' file:: 353 354 # echo 'command[:count] [if filter]' > trigger 355 356Triggers are removed by echoing the same command but starting with '!' 357to the 'trigger' file:: 358 359 # echo '!command[:count] [if filter]' > trigger 360 361The [if filter] part isn't used in matching commands when removing, so 362leaving that off in a '!' command will accomplish the same thing as 363having it in. 364 365The filter syntax is the same as that described in the 'Event 366filtering' section above. 367 368For ease of use, writing to the trigger file using '>' currently just 369adds or removes a single trigger and there's no explicit '>>' support 370('>' actually behaves like '>>') or truncation support to remove all 371triggers (you have to use '!' for each one added.) 372 3736.2 Supported trigger commands 374------------------------------ 375 376The following commands are supported: 377 378- enable_event/disable_event 379 380 These commands can enable or disable another trace event whenever 381 the triggering event is hit. When these commands are registered, 382 the other trace event is activated, but disabled in a "soft" mode. 383 That is, the tracepoint will be called, but just will not be traced. 384 The event tracepoint stays in this mode as long as there's a trigger 385 in effect that can trigger it. 386 387 For example, the following trigger causes kmalloc events to be 388 traced when a read system call is entered, and the :1 at the end 389 specifies that this enablement happens only once:: 390 391 # echo 'enable_event:kmem:kmalloc:1' > \ 392 /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger 393 394 The following trigger causes kmalloc events to stop being traced 395 when a read system call exits. This disablement happens on every 396 read system call exit:: 397 398 # echo 'disable_event:kmem:kmalloc' > \ 399 /sys/kernel/debug/tracing/events/syscalls/sys_exit_read/trigger 400 401 The format is:: 402 403 enable_event:<system>:<event>[:count] 404 disable_event:<system>:<event>[:count] 405 406 To remove the above commands:: 407 408 # echo '!enable_event:kmem:kmalloc:1' > \ 409 /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger 410 411 # echo '!disable_event:kmem:kmalloc' > \ 412 /sys/kernel/debug/tracing/events/syscalls/sys_exit_read/trigger 413 414 Note that there can be any number of enable/disable_event triggers 415 per triggering event, but there can only be one trigger per 416 triggered event. e.g. sys_enter_read can have triggers enabling both 417 kmem:kmalloc and sched:sched_switch, but can't have two kmem:kmalloc 418 versions such as kmem:kmalloc and kmem:kmalloc:1 or 'kmem:kmalloc if 419 bytes_req == 256' and 'kmem:kmalloc if bytes_alloc == 256' (they 420 could be combined into a single filter on kmem:kmalloc though). 421 422- stacktrace 423 424 This command dumps a stacktrace in the trace buffer whenever the 425 triggering event occurs. 426 427 For example, the following trigger dumps a stacktrace every time the 428 kmalloc tracepoint is hit:: 429 430 # echo 'stacktrace' > \ 431 /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger 432 433 The following trigger dumps a stacktrace the first 5 times a kmalloc 434 request happens with a size >= 64K:: 435 436 # echo 'stacktrace:5 if bytes_req >= 65536' > \ 437 /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger 438 439 The format is:: 440 441 stacktrace[:count] 442 443 To remove the above commands:: 444 445 # echo '!stacktrace' > \ 446 /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger 447 448 # echo '!stacktrace:5 if bytes_req >= 65536' > \ 449 /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger 450 451 The latter can also be removed more simply by the following (without 452 the filter):: 453 454 # echo '!stacktrace:5' > \ 455 /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger 456 457 Note that there can be only one stacktrace trigger per triggering 458 event. 459 460- snapshot 461 462 This command causes a snapshot to be triggered whenever the 463 triggering event occurs. 464 465 The following command creates a snapshot every time a block request 466 queue is unplugged with a depth > 1. If you were tracing a set of 467 events or functions at the time, the snapshot trace buffer would 468 capture those events when the trigger event occurred:: 469 470 # echo 'snapshot if nr_rq > 1' > \ 471 /sys/kernel/debug/tracing/events/block/block_unplug/trigger 472 473 To only snapshot once:: 474 475 # echo 'snapshot:1 if nr_rq > 1' > \ 476 /sys/kernel/debug/tracing/events/block/block_unplug/trigger 477 478 To remove the above commands:: 479 480 # echo '!snapshot if nr_rq > 1' > \ 481 /sys/kernel/debug/tracing/events/block/block_unplug/trigger 482 483 # echo '!snapshot:1 if nr_rq > 1' > \ 484 /sys/kernel/debug/tracing/events/block/block_unplug/trigger 485 486 Note that there can be only one snapshot trigger per triggering 487 event. 488 489- traceon/traceoff 490 491 These commands turn tracing on and off when the specified events are 492 hit. The parameter determines how many times the tracing system is 493 turned on and off. If unspecified, there is no limit. 494 495 The following command turns tracing off the first time a block 496 request queue is unplugged with a depth > 1. If you were tracing a 497 set of events or functions at the time, you could then examine the 498 trace buffer to see the sequence of events that led up to the 499 trigger event:: 500 501 # echo 'traceoff:1 if nr_rq > 1' > \ 502 /sys/kernel/debug/tracing/events/block/block_unplug/trigger 503 504 To always disable tracing when nr_rq > 1:: 505 506 # echo 'traceoff if nr_rq > 1' > \ 507 /sys/kernel/debug/tracing/events/block/block_unplug/trigger 508 509 To remove the above commands:: 510 511 # echo '!traceoff:1 if nr_rq > 1' > \ 512 /sys/kernel/debug/tracing/events/block/block_unplug/trigger 513 514 # echo '!traceoff if nr_rq > 1' > \ 515 /sys/kernel/debug/tracing/events/block/block_unplug/trigger 516 517 Note that there can be only one traceon or traceoff trigger per 518 triggering event. 519 520- hist 521 522 This command aggregates event hits into a hash table keyed on one or 523 more trace event format fields (or stacktrace) and a set of running 524 totals derived from one or more trace event format fields and/or 525 event counts (hitcount). 526 527 See Documentation/trace/histogram.rst for details and examples. 528 5296.3 In-kernel trace event API 530----------------------------- 531 532In most cases, the command-line interface to trace events is more than 533sufficient. Sometimes, however, applications might find the need for 534more complex relationships than can be expressed through a simple 535series of linked command-line expressions, or putting together sets of 536commands may be simply too cumbersome. An example might be an 537application that needs to 'listen' to the trace stream in order to 538maintain an in-kernel state machine detecting, for instance, when an 539illegal kernel state occurs in the scheduler. 540 541The trace event subsystem provides an in-kernel API allowing modules 542or other kernel code to generate user-defined 'synthetic' events at 543will, which can be used to either augment the existing trace stream 544and/or signal that a particular important state has occurred. 545 546A similar in-kernel API is also available for creating kprobe and 547kretprobe events. 548 549Both the synthetic event and k/ret/probe event APIs are built on top 550of a lower-level "dynevent_cmd" event command API, which is also 551available for more specialized applications, or as the basis of other 552higher-level trace event APIs. 553 554The API provided for these purposes is describe below and allows the 555following: 556 557 - dynamically creating synthetic event definitions 558 - dynamically creating kprobe and kretprobe event definitions 559 - tracing synthetic events from in-kernel code 560 - the low-level "dynevent_cmd" API 561 5626.3.1 Dyamically creating synthetic event definitions 563----------------------------------------------------- 564 565There are a couple ways to create a new synthetic event from a kernel 566module or other kernel code. 567 568The first creates the event in one step, using synth_event_create(). 569In this method, the name of the event to create and an array defining 570the fields is supplied to synth_event_create(). If successful, a 571synthetic event with that name and fields will exist following that 572call. For example, to create a new "schedtest" synthetic event: 573 574 ret = synth_event_create("schedtest", sched_fields, 575 ARRAY_SIZE(sched_fields), THIS_MODULE); 576 577The sched_fields param in this example points to an array of struct 578synth_field_desc, each of which describes an event field by type and 579name: 580 581 static struct synth_field_desc sched_fields[] = { 582 { .type = "pid_t", .name = "next_pid_field" }, 583 { .type = "char[16]", .name = "next_comm_field" }, 584 { .type = "u64", .name = "ts_ns" }, 585 { .type = "u64", .name = "ts_ms" }, 586 { .type = "unsigned int", .name = "cpu" }, 587 { .type = "char[64]", .name = "my_string_field" }, 588 { .type = "int", .name = "my_int_field" }, 589 }; 590 591See synth_field_size() for available types. If field_name contains [n] 592the field is considered to be an array. 593 594If the event is created from within a module, a pointer to the module 595must be passed to synth_event_create(). This will ensure that the 596trace buffer won't contain unreadable events when the module is 597removed. 598 599At this point, the event object is ready to be used for generating new 600events. 601 602In the second method, the event is created in several steps. This 603allows events to be created dynamically and without the need to create 604and populate an array of fields beforehand. 605 606To use this method, an empty or partially empty synthetic event should 607first be created using synth_event_gen_cmd_start() or 608synth_event_gen_cmd_array_start(). For synth_event_gen_cmd_start(), 609the name of the event along with one or more pairs of args each pair 610representing a 'type field_name;' field specification should be 611supplied. For synth_event_gen_cmd_array_start(), the name of the 612event along with an array of struct synth_field_desc should be 613supplied. Before calling synth_event_gen_cmd_start() or 614synth_event_gen_cmd_array_start(), the user should create and 615initialize a dynevent_cmd object using synth_event_cmd_init(). 616 617For example, to create a new "schedtest" synthetic event with two 618fields: 619 620 struct dynevent_cmd cmd; 621 char *buf; 622 623 /* Create a buffer to hold the generated command */ 624 buf = kzalloc(MAX_DYNEVENT_CMD_LEN, GFP_KERNEL); 625 626 /* Before generating the command, initialize the cmd object */ 627 synth_event_cmd_init(&cmd, buf, MAX_DYNEVENT_CMD_LEN); 628 629 ret = synth_event_gen_cmd_start(&cmd, "schedtest", THIS_MODULE, 630 "pid_t", "next_pid_field", 631 "u64", "ts_ns"); 632 633Alternatively, using an array of struct synth_field_desc fields 634containing the same information: 635 636 ret = synth_event_gen_cmd_array_start(&cmd, "schedtest", THIS_MODULE, 637 fields, n_fields); 638 639Once the synthetic event object has been created, it can then be 640populated with more fields. Fields are added one by one using 641synth_event_add_field(), supplying the dynevent_cmd object, a field 642type, and a field name. For example, to add a new int field named 643"intfield", the following call should be made: 644 645 ret = synth_event_add_field(&cmd, "int", "intfield"); 646 647See synth_field_size() for available types. If field_name contains [n] 648the field is considered to be an array. 649 650A group of fields can also be added all at once using an array of 651synth_field_desc with add_synth_fields(). For example, this would add 652just the first four sched_fields: 653 654 ret = synth_event_add_fields(&cmd, sched_fields, 4); 655 656If you already have a string of the form 'type field_name', 657synth_event_add_field_str() can be used to add it as-is; it will 658also automatically append a ';' to the string. 659 660Once all the fields have been added, the event should be finalized and 661registered by calling the synth_event_gen_cmd_end() function: 662 663 ret = synth_event_gen_cmd_end(&cmd); 664 665At this point, the event object is ready to be used for tracing new 666events. 667 6686.3.3 Tracing synthetic events from in-kernel code 669-------------------------------------------------- 670 671To trace a synthetic event, there are several options. The first 672option is to trace the event in one call, using synth_event_trace() 673with a variable number of values, or synth_event_trace_array() with an 674array of values to be set. A second option can be used to avoid the 675need for a pre-formed array of values or list of arguments, via 676synth_event_trace_start() and synth_event_trace_end() along with 677synth_event_add_next_val() or synth_event_add_val() to add the values 678piecewise. 679 6806.3.3.1 Tracing a synthetic event all at once 681--------------------------------------------- 682 683To trace a synthetic event all at once, the synth_event_trace() or 684synth_event_trace_array() functions can be used. 685 686The synth_event_trace() function is passed the trace_event_file 687representing the synthetic event (which can be retrieved using 688trace_get_event_file() using the synthetic event name, "synthetic" as 689the system name, and the trace instance name (NULL if using the global 690trace array)), along with an variable number of u64 args, one for each 691synthetic event field, and the number of values being passed. 692 693So, to trace an event corresponding to the synthetic event definition 694above, code like the following could be used: 695 696 ret = synth_event_trace(create_synth_test, 7, /* number of values */ 697 444, /* next_pid_field */ 698 (u64)"clackers", /* next_comm_field */ 699 1000000, /* ts_ns */ 700 1000, /* ts_ms */ 701 smp_processor_id(),/* cpu */ 702 (u64)"Thneed", /* my_string_field */ 703 999); /* my_int_field */ 704 705All vals should be cast to u64, and string vals are just pointers to 706strings, cast to u64. Strings will be copied into space reserved in 707the event for the string, using these pointers. 708 709Alternatively, the synth_event_trace_array() function can be used to 710accomplish the same thing. It is passed the trace_event_file 711representing the synthetic event (which can be retrieved using 712trace_get_event_file() using the synthetic event name, "synthetic" as 713the system name, and the trace instance name (NULL if using the global 714trace array)), along with an array of u64, one for each synthetic 715event field. 716 717To trace an event corresponding to the synthetic event definition 718above, code like the following could be used: 719 720 u64 vals[7]; 721 722 vals[0] = 777; /* next_pid_field */ 723 vals[1] = (u64)"tiddlywinks"; /* next_comm_field */ 724 vals[2] = 1000000; /* ts_ns */ 725 vals[3] = 1000; /* ts_ms */ 726 vals[4] = smp_processor_id(); /* cpu */ 727 vals[5] = (u64)"thneed"; /* my_string_field */ 728 vals[6] = 398; /* my_int_field */ 729 730The 'vals' array is just an array of u64, the number of which must 731match the number of field in the synthetic event, and which must be in 732the same order as the synthetic event fields. 733 734All vals should be cast to u64, and string vals are just pointers to 735strings, cast to u64. Strings will be copied into space reserved in 736the event for the string, using these pointers. 737 738In order to trace a synthetic event, a pointer to the trace event file 739is needed. The trace_get_event_file() function can be used to get 740it - it will find the file in the given trace instance (in this case 741NULL since the top trace array is being used) while at the same time 742preventing the instance containing it from going away: 743 744 schedtest_event_file = trace_get_event_file(NULL, "synthetic", 745 "schedtest"); 746 747Before tracing the event, it should be enabled in some way, otherwise 748the synthetic event won't actually show up in the trace buffer. 749 750To enable a synthetic event from the kernel, trace_array_set_clr_event() 751can be used (which is not specific to synthetic events, so does need 752the "synthetic" system name to be specified explicitly). 753 754To enable the event, pass 'true' to it: 755 756 trace_array_set_clr_event(schedtest_event_file->tr, 757 "synthetic", "schedtest", true); 758 759To disable it pass false: 760 761 trace_array_set_clr_event(schedtest_event_file->tr, 762 "synthetic", "schedtest", false); 763 764Finally, synth_event_trace_array() can be used to actually trace the 765event, which should be visible in the trace buffer afterwards: 766 767 ret = synth_event_trace_array(schedtest_event_file, vals, 768 ARRAY_SIZE(vals)); 769 770To remove the synthetic event, the event should be disabled, and the 771trace instance should be 'put' back using trace_put_event_file(): 772 773 trace_array_set_clr_event(schedtest_event_file->tr, 774 "synthetic", "schedtest", false); 775 trace_put_event_file(schedtest_event_file); 776 777If those have been successful, synth_event_delete() can be called to 778remove the event: 779 780 ret = synth_event_delete("schedtest"); 781 7826.3.3.1 Tracing a synthetic event piecewise 783------------------------------------------- 784 785To trace a synthetic using the piecewise method described above, the 786synth_event_trace_start() function is used to 'open' the synthetic 787event trace: 788 789 struct synth_trace_state trace_state; 790 791 ret = synth_event_trace_start(schedtest_event_file, &trace_state); 792 793It's passed the trace_event_file representing the synthetic event 794using the same methods as described above, along with a pointer to a 795struct synth_trace_state object, which will be zeroed before use and 796used to maintain state between this and following calls. 797 798Once the event has been opened, which means space for it has been 799reserved in the trace buffer, the individual fields can be set. There 800are two ways to do that, either one after another for each field in 801the event, which requires no lookups, or by name, which does. The 802tradeoff is flexibility in doing the assignments vs the cost of a 803lookup per field. 804 805To assign the values one after the other without lookups, 806synth_event_add_next_val() should be used. Each call is passed the 807same synth_trace_state object used in the synth_event_trace_start(), 808along with the value to set the next field in the event. After each 809field is set, the 'cursor' points to the next field, which will be set 810by the subsequent call, continuing until all the fields have been set 811in order. The same sequence of calls as in the above examples using 812this method would be (without error-handling code): 813 814 /* next_pid_field */ 815 ret = synth_event_add_next_val(777, &trace_state); 816 817 /* next_comm_field */ 818 ret = synth_event_add_next_val((u64)"slinky", &trace_state); 819 820 /* ts_ns */ 821 ret = synth_event_add_next_val(1000000, &trace_state); 822 823 /* ts_ms */ 824 ret = synth_event_add_next_val(1000, &trace_state); 825 826 /* cpu */ 827 ret = synth_event_add_next_val(smp_processor_id(), &trace_state); 828 829 /* my_string_field */ 830 ret = synth_event_add_next_val((u64)"thneed_2.01", &trace_state); 831 832 /* my_int_field */ 833 ret = synth_event_add_next_val(395, &trace_state); 834 835To assign the values in any order, synth_event_add_val() should be 836used. Each call is passed the same synth_trace_state object used in 837the synth_event_trace_start(), along with the field name of the field 838to set and the value to set it to. The same sequence of calls as in 839the above examples using this method would be (without error-handling 840code): 841 842 ret = synth_event_add_val("next_pid_field", 777, &trace_state); 843 ret = synth_event_add_val("next_comm_field", (u64)"silly putty", 844 &trace_state); 845 ret = synth_event_add_val("ts_ns", 1000000, &trace_state); 846 ret = synth_event_add_val("ts_ms", 1000, &trace_state); 847 ret = synth_event_add_val("cpu", smp_processor_id(), &trace_state); 848 ret = synth_event_add_val("my_string_field", (u64)"thneed_9", 849 &trace_state); 850 ret = synth_event_add_val("my_int_field", 3999, &trace_state); 851 852Note that synth_event_add_next_val() and synth_event_add_val() are 853incompatible if used within the same trace of an event - either one 854can be used but not both at the same time. 855 856Finally, the event won't be actually traced until it's 'closed', 857which is done using synth_event_trace_end(), which takes only the 858struct synth_trace_state object used in the previous calls: 859 860 ret = synth_event_trace_end(&trace_state); 861 862Note that synth_event_trace_end() must be called at the end regardless 863of whether any of the add calls failed (say due to a bad field name 864being passed in). 865 8666.3.4 Dyamically creating kprobe and kretprobe event definitions 867---------------------------------------------------------------- 868 869To create a kprobe or kretprobe trace event from kernel code, the 870kprobe_event_gen_cmd_start() or kretprobe_event_gen_cmd_start() 871functions can be used. 872 873To create a kprobe event, an empty or partially empty kprobe event 874should first be created using kprobe_event_gen_cmd_start(). The name 875of the event and the probe location should be specfied along with one 876or args each representing a probe field should be supplied to this 877function. Before calling kprobe_event_gen_cmd_start(), the user 878should create and initialize a dynevent_cmd object using 879kprobe_event_cmd_init(). 880 881For example, to create a new "schedtest" kprobe event with two fields: 882 883 struct dynevent_cmd cmd; 884 char *buf; 885 886 /* Create a buffer to hold the generated command */ 887 buf = kzalloc(MAX_DYNEVENT_CMD_LEN, GFP_KERNEL); 888 889 /* Before generating the command, initialize the cmd object */ 890 kprobe_event_cmd_init(&cmd, buf, MAX_DYNEVENT_CMD_LEN); 891 892 /* 893 * Define the gen_kprobe_test event with the first 2 kprobe 894 * fields. 895 */ 896 ret = kprobe_event_gen_cmd_start(&cmd, "gen_kprobe_test", "do_sys_open", 897 "dfd=%ax", "filename=%dx"); 898 899Once the kprobe event object has been created, it can then be 900populated with more fields. Fields can be added using 901kprobe_event_add_fields(), supplying the dynevent_cmd object along 902with a variable arg list of probe fields. For example, to add a 903couple additional fields, the following call could be made: 904 905 ret = kprobe_event_add_fields(&cmd, "flags=%cx", "mode=+4($stack)"); 906 907Once all the fields have been added, the event should be finalized and 908registered by calling the kprobe_event_gen_cmd_end() or 909kretprobe_event_gen_cmd_end() functions, depending on whether a kprobe 910or kretprobe command was started: 911 912 ret = kprobe_event_gen_cmd_end(&cmd); 913 914or 915 916 ret = kretprobe_event_gen_cmd_end(&cmd); 917 918At this point, the event object is ready to be used for tracing new 919events. 920 921Similarly, a kretprobe event can be created using 922kretprobe_event_gen_cmd_start() with a probe name and location and 923additional params such as $retval: 924 925 ret = kretprobe_event_gen_cmd_start(&cmd, "gen_kretprobe_test", 926 "do_sys_open", "$retval"); 927 928Similar to the synthetic event case, code like the following can be 929used to enable the newly created kprobe event: 930 931 gen_kprobe_test = trace_get_event_file(NULL, "kprobes", "gen_kprobe_test"); 932 933 ret = trace_array_set_clr_event(gen_kprobe_test->tr, 934 "kprobes", "gen_kprobe_test", true); 935 936Finally, also similar to synthetic events, the following code can be 937used to give the kprobe event file back and delete the event: 938 939 trace_put_event_file(gen_kprobe_test); 940 941 ret = kprobe_event_delete("gen_kprobe_test"); 942 9436.3.4 The "dynevent_cmd" low-level API 944-------------------------------------- 945 946Both the in-kernel synthetic event and kprobe interfaces are built on 947top of a lower-level "dynevent_cmd" interface. This interface is 948meant to provide the basis for higher-level interfaces such as the 949synthetic and kprobe interfaces, which can be used as examples. 950 951The basic idea is simple and amounts to providing a general-purpose 952layer that can be used to generate trace event commands. The 953generated command strings can then be passed to the command-parsing 954and event creation code that already exists in the trace event 955subystem for creating the corresponding trace events. 956 957In a nutshell, the way it works is that the higher-level interface 958code creates a struct dynevent_cmd object, then uses a couple 959functions, dynevent_arg_add() and dynevent_arg_pair_add() to build up 960a command string, which finally causes the command to be executed 961using the dynevent_create() function. The details of the interface 962are described below. 963 964The first step in building a new command string is to create and 965initialize an instance of a dynevent_cmd. Here, for instance, we 966create a dynevent_cmd on the stack and initialize it: 967 968 struct dynevent_cmd cmd; 969 char *buf; 970 int ret; 971 972 buf = kzalloc(MAX_DYNEVENT_CMD_LEN, GFP_KERNEL); 973 974 dynevent_cmd_init(cmd, buf, maxlen, DYNEVENT_TYPE_FOO, 975 foo_event_run_command); 976 977The dynevent_cmd initialization needs to be given a user-specified 978buffer and the length of the buffer (MAX_DYNEVENT_CMD_LEN can be used 979for this purpose - at 2k it's generally too big to be comfortably put 980on the stack, so is dynamically allocated), a dynevent type id, which 981is meant to be used to check that further API calls are for the 982correct command type, and a pointer to an event-specific run_command() 983callback that will be called to actually execute the event-specific 984command function. 985 986Once that's done, the command string can by built up by successive 987calls to argument-adding functions. 988 989To add a single argument, define and initialize a struct dynevent_arg 990or struct dynevent_arg_pair object. Here's an example of the simplest 991possible arg addition, which is simply to append the given string as 992a whitespace-separated argument to the command: 993 994 struct dynevent_arg arg; 995 996 dynevent_arg_init(&arg, NULL, 0); 997 998 arg.str = name; 999 1000 ret = dynevent_arg_add(cmd, &arg); 1001 1002The arg object is first initialized using dynevent_arg_init() and in 1003this case the parameters are NULL or 0, which means there's no 1004optional sanity-checking function or separator appended to the end of 1005the arg. 1006 1007Here's another more complicated example using an 'arg pair', which is 1008used to create an argument that consists of a couple components added 1009together as a unit, for example, a 'type field_name;' arg or a simple 1010expression arg e.g. 'flags=%cx': 1011 1012 struct dynevent_arg_pair arg_pair; 1013 1014 dynevent_arg_pair_init(&arg_pair, dynevent_foo_check_arg_fn, 0, ';'); 1015 1016 arg_pair.lhs = type; 1017 arg_pair.rhs = name; 1018 1019 ret = dynevent_arg_pair_add(cmd, &arg_pair); 1020 1021Again, the arg_pair is first initialized, in this case with a callback 1022function used to check the sanity of the args (for example, that 1023neither part of the pair is NULL), along with a character to be used 1024to add an operator between the pair (here none) and a separator to be 1025appended onto the end of the arg pair (here ';'). 1026 1027There's also a dynevent_str_add() function that can be used to simply 1028add a string as-is, with no spaces, delimeters, or arg check. 1029 1030Any number of dynevent_*_add() calls can be made to build up the string 1031(until its length surpasses cmd->maxlen). When all the arguments have 1032been added and the command string is complete, the only thing left to 1033do is run the command, which happens by simply calling 1034dynevent_create(): 1035 1036 ret = dynevent_create(&cmd); 1037 1038At that point, if the return value is 0, the dynamic event has been 1039created and is ready to use. 1040 1041See the dynevent_cmd function definitions themselves for the details 1042of the API. 1043