1=========================================
2user_events: User-based Event Tracing
3=========================================
4
5:Author: Beau Belgrave
6
7Overview
8--------
9User based trace events allow user processes to create events and trace data
10that can be viewed via existing tools, such as ftrace and perf.
11To enable this feature, build your kernel with CONFIG_USER_EVENTS=y.
12
13Programs can view status of the events via
14/sys/kernel/tracing/user_events_status and can both register and write
15data out via /sys/kernel/tracing/user_events_data.
16
17Programs can also use /sys/kernel/tracing/dynamic_events to register and
18delete user based events via the u: prefix. The format of the command to
19dynamic_events is the same as the ioctl with the u: prefix applied.
20
21Typically programs will register a set of events that they wish to expose to
22tools that can read trace_events (such as ftrace and perf). The registration
23process tells the kernel which address and bit to reflect if any tool has
24enabled the event and data should be written. The registration will give back
25a write index which describes the data when a write() or writev() is called
26on the /sys/kernel/tracing/user_events_data file.
27
28The structures referenced in this document are contained within the
29/include/uapi/linux/user_events.h file in the source tree.
30
31**NOTE:** *Both user_events_status and user_events_data are under the tracefs
32filesystem and may be mounted at different paths than above.*
33
34Registering
35-----------
36Registering within a user process is done via ioctl() out to the
37/sys/kernel/tracing/user_events_data file. The command to issue is
38DIAG_IOCSREG.
39
40This command takes a packed struct user_reg as an argument::
41
42  struct user_reg {
43        /* Input: Size of the user_reg structure being used */
44        __u32 size;
45
46        /* Input: Bit in enable address to use */
47        __u8 enable_bit;
48
49        /* Input: Enable size in bytes at address */
50        __u8 enable_size;
51
52        /* Input: Flags for future use, set to 0 */
53        __u16 flags;
54
55        /* Input: Address to update when enabled */
56        __u64 enable_addr;
57
58        /* Input: Pointer to string with event name, description and flags */
59        __u64 name_args;
60
61        /* Output: Index of the event to use when writing data */
62        __u32 write_index;
63  } __attribute__((__packed__));
64
65The struct user_reg requires all the above inputs to be set appropriately.
66
67+ size: This must be set to sizeof(struct user_reg).
68
69+ enable_bit: The bit to reflect the event status at the address specified by
70  enable_addr.
71
72+ enable_size: The size of the value specified by enable_addr.
73  This must be 4 (32-bit) or 8 (64-bit). 64-bit values are only allowed to be
74  used on 64-bit kernels, however, 32-bit can be used on all kernels.
75
76+ flags: The flags to use, if any. For the initial version this must be 0.
77  Callers should first attempt to use flags and retry without flags to ensure
78  support for lower versions of the kernel. If a flag is not supported -EINVAL
79  is returned.
80
81+ enable_addr: The address of the value to use to reflect event status. This
82  must be naturally aligned and write accessible within the user program.
83
84+ name_args: The name and arguments to describe the event, see command format
85  for details.
86
87Upon successful registration the following is set.
88
89+ write_index: The index to use for this file descriptor that represents this
90  event when writing out data. The index is unique to this instance of the file
91  descriptor that was used for the registration. See writing data for details.
92
93User based events show up under tracefs like any other event under the
94subsystem named "user_events". This means tools that wish to attach to the
95events need to use /sys/kernel/tracing/events/user_events/[name]/enable
96or perf record -e user_events:[name] when attaching/recording.
97
98**NOTE:** The event subsystem name by default is "user_events". Callers should
99not assume it will always be "user_events". Operators reserve the right in the
100future to change the subsystem name per-process to accomodate event isolation.
101
102Command Format
103^^^^^^^^^^^^^^
104The command string format is as follows::
105
106  name[:FLAG1[,FLAG2...]] [Field1[;Field2...]]
107
108Supported Flags
109^^^^^^^^^^^^^^^
110None yet
111
112Field Format
113^^^^^^^^^^^^
114::
115
116  type name [size]
117
118Basic types are supported (__data_loc, u32, u64, int, char, char[20], etc).
119User programs are encouraged to use clearly sized types like u32.
120
121**NOTE:** *Long is not supported since size can vary between user and kernel.*
122
123The size is only valid for types that start with a struct prefix.
124This allows user programs to describe custom structs out to tools, if required.
125
126For example, a struct in C that looks like this::
127
128  struct mytype {
129    char data[20];
130  };
131
132Would be represented by the following field::
133
134  struct mytype myname 20
135
136Deleting
137--------
138Deleting an event from within a user process is done via ioctl() out to the
139/sys/kernel/tracing/user_events_data file. The command to issue is
140DIAG_IOCSDEL.
141
142This command only requires a single string specifying the event to delete by
143its name. Delete will only succeed if there are no references left to the
144event (in both user and kernel space). User programs should use a separate file
145to request deletes than the one used for registration due to this.
146
147Unregistering
148-------------
149If after registering an event it is no longer wanted to be updated then it can
150be disabled via ioctl() out to the /sys/kernel/tracing/user_events_data file.
151The command to issue is DIAG_IOCSUNREG. This is different than deleting, where
152deleting actually removes the event from the system. Unregistering simply tells
153the kernel your process is no longer interested in updates to the event.
154
155This command takes a packed struct user_unreg as an argument::
156
157  struct user_unreg {
158        /* Input: Size of the user_unreg structure being used */
159        __u32 size;
160
161        /* Input: Bit to unregister */
162        __u8 disable_bit;
163
164        /* Input: Reserved, set to 0 */
165        __u8 __reserved;
166
167        /* Input: Reserved, set to 0 */
168        __u16 __reserved2;
169
170        /* Input: Address to unregister */
171        __u64 disable_addr;
172  } __attribute__((__packed__));
173
174The struct user_unreg requires all the above inputs to be set appropriately.
175
176+ size: This must be set to sizeof(struct user_unreg).
177
178+ disable_bit: This must be set to the bit to disable (same bit that was
179  previously registered via enable_bit).
180
181+ disable_addr: This must be set to the address to disable (same address that was
182  previously registered via enable_addr).
183
184**NOTE:** Events are automatically unregistered when execve() is invoked. During
185fork() the registered events will be retained and must be unregistered manually
186in each process if wanted.
187
188Status
189------
190When tools attach/record user based events the status of the event is updated
191in realtime. This allows user programs to only incur the cost of the write() or
192writev() calls when something is actively attached to the event.
193
194The kernel will update the specified bit that was registered for the event as
195tools attach/detach from the event. User programs simply check if the bit is set
196to see if something is attached or not.
197
198Administrators can easily check the status of all registered events by reading
199the user_events_status file directly via a terminal. The output is as follows::
200
201  Name [# Comments]
202  ...
203
204  Active: ActiveCount
205  Busy: BusyCount
206
207For example, on a system that has a single event the output looks like this::
208
209  test
210
211  Active: 1
212  Busy: 0
213
214If a user enables the user event via ftrace, the output would change to this::
215
216  test # Used by ftrace
217
218  Active: 1
219  Busy: 1
220
221Writing Data
222------------
223After registering an event the same fd that was used to register can be used
224to write an entry for that event. The write_index returned must be at the start
225of the data, then the remaining data is treated as the payload of the event.
226
227For example, if write_index returned was 1 and I wanted to write out an int
228payload of the event. Then the data would have to be 8 bytes (2 ints) in size,
229with the first 4 bytes being equal to 1 and the last 4 bytes being equal to the
230value I want as the payload.
231
232In memory this would look like this::
233
234  int index;
235  int payload;
236
237User programs might have well known structs that they wish to use to emit out
238as payloads. In those cases writev() can be used, with the first vector being
239the index and the following vector(s) being the actual event payload.
240
241For example, if I have a struct like this::
242
243  struct payload {
244        int src;
245        int dst;
246        int flags;
247  } __attribute__((__packed__));
248
249It's advised for user programs to do the following::
250
251  struct iovec io[2];
252  struct payload e;
253
254  io[0].iov_base = &write_index;
255  io[0].iov_len = sizeof(write_index);
256  io[1].iov_base = &e;
257  io[1].iov_len = sizeof(e);
258
259  writev(fd, (const struct iovec*)io, 2);
260
261**NOTE:** *The write_index is not emitted out into the trace being recorded.*
262
263Example Code
264------------
265See sample code in samples/user_events.
266