1864ea0e1SBeau Belgrave========================================= 2864ea0e1SBeau Belgraveuser_events: User-based Event Tracing 3864ea0e1SBeau Belgrave========================================= 4864ea0e1SBeau Belgrave 5864ea0e1SBeau Belgrave:Author: Beau Belgrave 6864ea0e1SBeau Belgrave 7864ea0e1SBeau BelgraveOverview 8864ea0e1SBeau Belgrave-------- 9864ea0e1SBeau BelgraveUser based trace events allow user processes to create events and trace data 10768c1e7fSBeau Belgravethat can be viewed via existing tools, such as ftrace and perf. 11864ea0e1SBeau BelgraveTo enable this feature, build your kernel with CONFIG_USER_EVENTS=y. 12864ea0e1SBeau Belgrave 13864ea0e1SBeau BelgravePrograms can view status of the events via 142abfcd29SRoss Zwisler/sys/kernel/tracing/user_events_status and can both register and write 152abfcd29SRoss Zwislerdata out via /sys/kernel/tracing/user_events_data. 16864ea0e1SBeau Belgrave 17864ea0e1SBeau BelgraveTypically programs will register a set of events that they wish to expose to 18864ea0e1SBeau Belgravetools that can read trace_events (such as ftrace and perf). The registration 1927dc2ae7SBeau Belgraveprocess tells the kernel which address and bit to reflect if any tool has 2027dc2ae7SBeau Belgraveenabled the event and data should be written. The registration will give back 2127dc2ae7SBeau Belgravea write index which describes the data when a write() or writev() is called 2227dc2ae7SBeau Belgraveon the /sys/kernel/tracing/user_events_data file. 23864ea0e1SBeau Belgrave 24933678b6SBeau BelgraveThe structures referenced in this document are contained within the 25933678b6SBeau Belgrave/include/uapi/linux/user_events.h file in the source tree. 26864ea0e1SBeau Belgrave 27864ea0e1SBeau Belgrave**NOTE:** *Both user_events_status and user_events_data are under the tracefs 28864ea0e1SBeau Belgravefilesystem and may be mounted at different paths than above.* 29864ea0e1SBeau Belgrave 30864ea0e1SBeau BelgraveRegistering 31864ea0e1SBeau Belgrave----------- 32864ea0e1SBeau BelgraveRegistering within a user process is done via ioctl() out to the 332abfcd29SRoss Zwisler/sys/kernel/tracing/user_events_data file. The command to issue is 34864ea0e1SBeau BelgraveDIAG_IOCSREG. 35864ea0e1SBeau Belgrave 36933678b6SBeau BelgraveThis command takes a packed struct user_reg as an argument:: 37864ea0e1SBeau Belgrave 38864ea0e1SBeau Belgrave struct user_reg { 3927dc2ae7SBeau Belgrave /* Input: Size of the user_reg structure being used */ 4027dc2ae7SBeau Belgrave __u32 size; 41864ea0e1SBeau Belgrave 4227dc2ae7SBeau Belgrave /* Input: Bit in enable address to use */ 4327dc2ae7SBeau Belgrave __u8 enable_bit; 4427dc2ae7SBeau Belgrave 4527dc2ae7SBeau Belgrave /* Input: Enable size in bytes at address */ 4627dc2ae7SBeau Belgrave __u8 enable_size; 4727dc2ae7SBeau Belgrave 4827dc2ae7SBeau Belgrave /* Input: Flags for future use, set to 0 */ 4927dc2ae7SBeau Belgrave __u16 flags; 5027dc2ae7SBeau Belgrave 5127dc2ae7SBeau Belgrave /* Input: Address to update when enabled */ 5227dc2ae7SBeau Belgrave __u64 enable_addr; 5327dc2ae7SBeau Belgrave 5427dc2ae7SBeau Belgrave /* Input: Pointer to string with event name, description and flags */ 5527dc2ae7SBeau Belgrave __u64 name_args; 5627dc2ae7SBeau Belgrave 5727dc2ae7SBeau Belgrave /* Output: Index of the event to use when writing data */ 5827dc2ae7SBeau Belgrave __u32 write_index; 5927dc2ae7SBeau Belgrave } __attribute__((__packed__)); 6027dc2ae7SBeau Belgrave 6127dc2ae7SBeau BelgraveThe struct user_reg requires all the above inputs to be set appropriately. 6227dc2ae7SBeau Belgrave 6327dc2ae7SBeau Belgrave+ size: This must be set to sizeof(struct user_reg). 6427dc2ae7SBeau Belgrave 6527dc2ae7SBeau Belgrave+ enable_bit: The bit to reflect the event status at the address specified by 6627dc2ae7SBeau Belgrave enable_addr. 6727dc2ae7SBeau Belgrave 6827dc2ae7SBeau Belgrave+ enable_size: The size of the value specified by enable_addr. 6927dc2ae7SBeau Belgrave This must be 4 (32-bit) or 8 (64-bit). 64-bit values are only allowed to be 7027dc2ae7SBeau Belgrave used on 64-bit kernels, however, 32-bit can be used on all kernels. 7127dc2ae7SBeau Belgrave 7227dc2ae7SBeau Belgrave+ flags: The flags to use, if any. For the initial version this must be 0. 7327dc2ae7SBeau Belgrave Callers should first attempt to use flags and retry without flags to ensure 7427dc2ae7SBeau Belgrave support for lower versions of the kernel. If a flag is not supported -EINVAL 7527dc2ae7SBeau Belgrave is returned. 7627dc2ae7SBeau Belgrave 7727dc2ae7SBeau Belgrave+ enable_addr: The address of the value to use to reflect event status. This 7827dc2ae7SBeau Belgrave must be naturally aligned and write accessible within the user program. 7927dc2ae7SBeau Belgrave 8027dc2ae7SBeau Belgrave+ name_args: The name and arguments to describe the event, see command format 8127dc2ae7SBeau Belgrave for details. 8227dc2ae7SBeau Belgrave 8327dc2ae7SBeau BelgraveUpon successful registration the following is set. 8427dc2ae7SBeau Belgrave 8527dc2ae7SBeau Belgrave+ write_index: The index to use for this file descriptor that represents this 8627dc2ae7SBeau Belgrave event when writing out data. The index is unique to this instance of the file 8727dc2ae7SBeau Belgrave descriptor that was used for the registration. See writing data for details. 88864ea0e1SBeau Belgrave 89864ea0e1SBeau BelgraveUser based events show up under tracefs like any other event under the 90864ea0e1SBeau Belgravesubsystem named "user_events". This means tools that wish to attach to the 912abfcd29SRoss Zwislerevents need to use /sys/kernel/tracing/events/user_events/[name]/enable 92864ea0e1SBeau Belgraveor perf record -e user_events:[name] when attaching/recording. 93864ea0e1SBeau Belgrave 9427dc2ae7SBeau Belgrave**NOTE:** The event subsystem name by default is "user_events". Callers should 9527dc2ae7SBeau Belgravenot assume it will always be "user_events". Operators reserve the right in the 96*d56b699dSBjorn Helgaasfuture to change the subsystem name per-process to accommodate event isolation. 97864ea0e1SBeau Belgrave 98864ea0e1SBeau BelgraveCommand Format 99864ea0e1SBeau Belgrave^^^^^^^^^^^^^^ 100864ea0e1SBeau BelgraveThe command string format is as follows:: 101864ea0e1SBeau Belgrave 102864ea0e1SBeau Belgrave name[:FLAG1[,FLAG2...]] [Field1[;Field2...]] 103864ea0e1SBeau Belgrave 104864ea0e1SBeau BelgraveSupported Flags 105864ea0e1SBeau Belgrave^^^^^^^^^^^^^^^ 106768c1e7fSBeau BelgraveNone yet 107864ea0e1SBeau Belgrave 108864ea0e1SBeau BelgraveField Format 109864ea0e1SBeau Belgrave^^^^^^^^^^^^ 110864ea0e1SBeau Belgrave:: 111864ea0e1SBeau Belgrave 112864ea0e1SBeau Belgrave type name [size] 113864ea0e1SBeau Belgrave 114864ea0e1SBeau BelgraveBasic types are supported (__data_loc, u32, u64, int, char, char[20], etc). 115864ea0e1SBeau BelgraveUser programs are encouraged to use clearly sized types like u32. 116864ea0e1SBeau Belgrave 117864ea0e1SBeau Belgrave**NOTE:** *Long is not supported since size can vary between user and kernel.* 118864ea0e1SBeau Belgrave 119864ea0e1SBeau BelgraveThe size is only valid for types that start with a struct prefix. 120864ea0e1SBeau BelgraveThis allows user programs to describe custom structs out to tools, if required. 121864ea0e1SBeau Belgrave 122864ea0e1SBeau BelgraveFor example, a struct in C that looks like this:: 123864ea0e1SBeau Belgrave 124864ea0e1SBeau Belgrave struct mytype { 125864ea0e1SBeau Belgrave char data[20]; 126864ea0e1SBeau Belgrave }; 127864ea0e1SBeau Belgrave 128864ea0e1SBeau BelgraveWould be represented by the following field:: 129864ea0e1SBeau Belgrave 130864ea0e1SBeau Belgrave struct mytype myname 20 131864ea0e1SBeau Belgrave 132864ea0e1SBeau BelgraveDeleting 13327dc2ae7SBeau Belgrave-------- 134864ea0e1SBeau BelgraveDeleting an event from within a user process is done via ioctl() out to the 1352abfcd29SRoss Zwisler/sys/kernel/tracing/user_events_data file. The command to issue is 136864ea0e1SBeau BelgraveDIAG_IOCSDEL. 137864ea0e1SBeau Belgrave 138864ea0e1SBeau BelgraveThis command only requires a single string specifying the event to delete by 139864ea0e1SBeau Belgraveits name. Delete will only succeed if there are no references left to the 140864ea0e1SBeau Belgraveevent (in both user and kernel space). User programs should use a separate file 141864ea0e1SBeau Belgraveto request deletes than the one used for registration due to this. 142864ea0e1SBeau Belgrave 1430113d461SBeau Belgrave**NOTE:** By default events will auto-delete when there are no references left 1440113d461SBeau Belgraveto the event. Flags in the future may change this logic. 1450113d461SBeau Belgrave 14627dc2ae7SBeau BelgraveUnregistering 14727dc2ae7SBeau Belgrave------------- 14827dc2ae7SBeau BelgraveIf after registering an event it is no longer wanted to be updated then it can 14927dc2ae7SBeau Belgravebe disabled via ioctl() out to the /sys/kernel/tracing/user_events_data file. 15027dc2ae7SBeau BelgraveThe command to issue is DIAG_IOCSUNREG. This is different than deleting, where 15127dc2ae7SBeau Belgravedeleting actually removes the event from the system. Unregistering simply tells 15227dc2ae7SBeau Belgravethe kernel your process is no longer interested in updates to the event. 15327dc2ae7SBeau Belgrave 15427dc2ae7SBeau BelgraveThis command takes a packed struct user_unreg as an argument:: 15527dc2ae7SBeau Belgrave 15627dc2ae7SBeau Belgrave struct user_unreg { 15727dc2ae7SBeau Belgrave /* Input: Size of the user_unreg structure being used */ 15827dc2ae7SBeau Belgrave __u32 size; 15927dc2ae7SBeau Belgrave 16027dc2ae7SBeau Belgrave /* Input: Bit to unregister */ 16127dc2ae7SBeau Belgrave __u8 disable_bit; 16227dc2ae7SBeau Belgrave 16327dc2ae7SBeau Belgrave /* Input: Reserved, set to 0 */ 16427dc2ae7SBeau Belgrave __u8 __reserved; 16527dc2ae7SBeau Belgrave 16627dc2ae7SBeau Belgrave /* Input: Reserved, set to 0 */ 16727dc2ae7SBeau Belgrave __u16 __reserved2; 16827dc2ae7SBeau Belgrave 16927dc2ae7SBeau Belgrave /* Input: Address to unregister */ 17027dc2ae7SBeau Belgrave __u64 disable_addr; 17127dc2ae7SBeau Belgrave } __attribute__((__packed__)); 17227dc2ae7SBeau Belgrave 17327dc2ae7SBeau BelgraveThe struct user_unreg requires all the above inputs to be set appropriately. 17427dc2ae7SBeau Belgrave 17527dc2ae7SBeau Belgrave+ size: This must be set to sizeof(struct user_unreg). 17627dc2ae7SBeau Belgrave 17727dc2ae7SBeau Belgrave+ disable_bit: This must be set to the bit to disable (same bit that was 17827dc2ae7SBeau Belgrave previously registered via enable_bit). 17927dc2ae7SBeau Belgrave 18027dc2ae7SBeau Belgrave+ disable_addr: This must be set to the address to disable (same address that was 18127dc2ae7SBeau Belgrave previously registered via enable_addr). 18227dc2ae7SBeau Belgrave 18327dc2ae7SBeau Belgrave**NOTE:** Events are automatically unregistered when execve() is invoked. During 18427dc2ae7SBeau Belgravefork() the registered events will be retained and must be unregistered manually 18527dc2ae7SBeau Belgravein each process if wanted. 18627dc2ae7SBeau Belgrave 187864ea0e1SBeau BelgraveStatus 188864ea0e1SBeau Belgrave------ 189864ea0e1SBeau BelgraveWhen tools attach/record user based events the status of the event is updated 190864ea0e1SBeau Belgravein realtime. This allows user programs to only incur the cost of the write() or 191864ea0e1SBeau Belgravewritev() calls when something is actively attached to the event. 192864ea0e1SBeau Belgrave 19327dc2ae7SBeau BelgraveThe kernel will update the specified bit that was registered for the event as 19427dc2ae7SBeau Belgravetools attach/detach from the event. User programs simply check if the bit is set 19527dc2ae7SBeau Belgraveto see if something is attached or not. 196864ea0e1SBeau Belgrave 197864ea0e1SBeau BelgraveAdministrators can easily check the status of all registered events by reading 198864ea0e1SBeau Belgravethe user_events_status file directly via a terminal. The output is as follows:: 199864ea0e1SBeau Belgrave 20027dc2ae7SBeau Belgrave Name [# Comments] 201864ea0e1SBeau Belgrave ... 202864ea0e1SBeau Belgrave 203864ea0e1SBeau Belgrave Active: ActiveCount 204864ea0e1SBeau Belgrave Busy: BusyCount 205864ea0e1SBeau Belgrave 206864ea0e1SBeau BelgraveFor example, on a system that has a single event the output looks like this:: 207864ea0e1SBeau Belgrave 20827dc2ae7SBeau Belgrave test 209864ea0e1SBeau Belgrave 210864ea0e1SBeau Belgrave Active: 1 211864ea0e1SBeau Belgrave Busy: 0 212864ea0e1SBeau Belgrave 213864ea0e1SBeau BelgraveIf a user enables the user event via ftrace, the output would change to this:: 214864ea0e1SBeau Belgrave 21527dc2ae7SBeau Belgrave test # Used by ftrace 216864ea0e1SBeau Belgrave 217864ea0e1SBeau Belgrave Active: 1 218864ea0e1SBeau Belgrave Busy: 1 219864ea0e1SBeau Belgrave 220864ea0e1SBeau BelgraveWriting Data 221864ea0e1SBeau Belgrave------------ 222864ea0e1SBeau BelgraveAfter registering an event the same fd that was used to register can be used 223864ea0e1SBeau Belgraveto write an entry for that event. The write_index returned must be at the start 224864ea0e1SBeau Belgraveof the data, then the remaining data is treated as the payload of the event. 225864ea0e1SBeau Belgrave 226864ea0e1SBeau BelgraveFor example, if write_index returned was 1 and I wanted to write out an int 227864ea0e1SBeau Belgravepayload of the event. Then the data would have to be 8 bytes (2 ints) in size, 228864ea0e1SBeau Belgravewith the first 4 bytes being equal to 1 and the last 4 bytes being equal to the 229864ea0e1SBeau Belgravevalue I want as the payload. 230864ea0e1SBeau Belgrave 231864ea0e1SBeau BelgraveIn memory this would look like this:: 232864ea0e1SBeau Belgrave 233864ea0e1SBeau Belgrave int index; 234864ea0e1SBeau Belgrave int payload; 235864ea0e1SBeau Belgrave 236864ea0e1SBeau BelgraveUser programs might have well known structs that they wish to use to emit out 237864ea0e1SBeau Belgraveas payloads. In those cases writev() can be used, with the first vector being 238864ea0e1SBeau Belgravethe index and the following vector(s) being the actual event payload. 239864ea0e1SBeau Belgrave 240864ea0e1SBeau BelgraveFor example, if I have a struct like this:: 241864ea0e1SBeau Belgrave 242864ea0e1SBeau Belgrave struct payload { 243864ea0e1SBeau Belgrave int src; 244864ea0e1SBeau Belgrave int dst; 245864ea0e1SBeau Belgrave int flags; 24627dc2ae7SBeau Belgrave } __attribute__((__packed__)); 247864ea0e1SBeau Belgrave 248864ea0e1SBeau BelgraveIt's advised for user programs to do the following:: 249864ea0e1SBeau Belgrave 250864ea0e1SBeau Belgrave struct iovec io[2]; 251864ea0e1SBeau Belgrave struct payload e; 252864ea0e1SBeau Belgrave 253864ea0e1SBeau Belgrave io[0].iov_base = &write_index; 254864ea0e1SBeau Belgrave io[0].iov_len = sizeof(write_index); 255864ea0e1SBeau Belgrave io[1].iov_base = &e; 256864ea0e1SBeau Belgrave io[1].iov_len = sizeof(e); 257864ea0e1SBeau Belgrave 258864ea0e1SBeau Belgrave writev(fd, (const struct iovec*)io, 2); 259864ea0e1SBeau Belgrave 260864ea0e1SBeau Belgrave**NOTE:** *The write_index is not emitted out into the trace being recorded.* 261864ea0e1SBeau Belgrave 262864ea0e1SBeau BelgraveExample Code 263864ea0e1SBeau Belgrave------------ 264864ea0e1SBeau BelgraveSee sample code in samples/user_events. 265