1============= 2BPF Iterators 3============= 4 5 6---------- 7Motivation 8---------- 9 10There are a few existing ways to dump kernel data into user space. The most 11popular one is the ``/proc`` system. For example, ``cat /proc/net/tcp6`` dumps 12all tcp6 sockets in the system, and ``cat /proc/net/netlink`` dumps all netlink 13sockets in the system. However, their output format tends to be fixed, and if 14users want more information about these sockets, they have to patch the kernel, 15which often takes time to publish upstream and release. The same is true for popular 16tools like `ss <https://man7.org/linux/man-pages/man8/ss.8.html>`_ where any 17additional information needs a kernel patch. 18 19To solve this problem, the `drgn 20<https://www.kernel.org/doc/html/latest/bpf/drgn.html>`_ tool is often used to 21dig out the kernel data with no kernel change. However, the main drawback for 22drgn is performance, as it cannot do pointer tracing inside the kernel. In 23addition, drgn cannot validate a pointer value and may read invalid data if the 24pointer becomes invalid inside the kernel. 25 26The BPF iterator solves the above problem by providing flexibility on what data 27(e.g., tasks, bpf_maps, etc.) to collect by calling BPF programs for each kernel 28data object. 29 30---------------------- 31How BPF Iterators Work 32---------------------- 33 34A BPF iterator is a type of BPF program that allows users to iterate over 35specific types of kernel objects. Unlike traditional BPF tracing programs that 36allow users to define callbacks that are invoked at particular points of 37execution in the kernel, BPF iterators allow users to define callbacks that 38should be executed for every entry in a variety of kernel data structures. 39 40For example, users can define a BPF iterator that iterates over every task on 41the system and dumps the total amount of CPU runtime currently used by each of 42them. Another BPF task iterator may instead dump the cgroup information for each 43task. Such flexibility is the core value of BPF iterators. 44 45A BPF program is always loaded into the kernel at the behest of a user space 46process. A user space process loads a BPF program by opening and initializing 47the program skeleton as required and then invoking a syscall to have the BPF 48program verified and loaded by the kernel. 49 50In traditional tracing programs, a program is activated by having user space 51obtain a ``bpf_link`` to the program with ``bpf_program__attach()``. Once 52activated, the program callback will be invoked whenever the tracepoint is 53triggered in the main kernel. For BPF iterator programs, a ``bpf_link`` to the 54program is obtained using ``bpf_link_create()``, and the program callback is 55invoked by issuing system calls from user space. 56 57Next, let us see how you can use the iterators to iterate on kernel objects and 58read data. 59 60------------------------ 61How to Use BPF iterators 62------------------------ 63 64BPF selftests are a great resource to illustrate how to use the iterators. In 65this section, we’ll walk through a BPF selftest which shows how to load and use 66a BPF iterator program. To begin, we’ll look at `bpf_iter.c 67<https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/prog_tests/bpf_iter.c>`_, 68which illustrates how to load and trigger BPF iterators on the user space side. 69Later, we’ll look at a BPF program that runs in kernel space. 70 71Loading a BPF iterator in the kernel from user space typically involves the 72following steps: 73 74* The BPF program is loaded into the kernel through ``libbpf``. Once the kernel 75 has verified and loaded the program, it returns a file descriptor (fd) to user 76 space. 77* Obtain a ``link_fd`` to the BPF program by calling the ``bpf_link_create()`` 78 specified with the BPF program file descriptor received from the kernel. 79* Next, obtain a BPF iterator file descriptor (``bpf_iter_fd``) by calling the 80 ``bpf_iter_create()`` specified with the ``bpf_link`` received from Step 2. 81* Trigger the iteration by calling ``read(bpf_iter_fd)`` until no data is 82 available. 83* Close the iterator fd using ``close(bpf_iter_fd)``. 84* If needed to reread the data, get a new ``bpf_iter_fd`` and do the read again. 85 86The following are a few examples of selftest BPF iterator programs: 87 88* `bpf_iter_tcp4.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_tcp4.c>`_ 89* `bpf_iter_task_vma.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c>`_ 90* `bpf_iter_task_file.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_file.c>`_ 91 92Let us look at ``bpf_iter_task_file.c``, which runs in kernel space: 93 94Here is the definition of ``bpf_iter__task_file`` in `vmlinux.h 95<https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html#btf>`_. 96Any struct name in ``vmlinux.h`` in the format ``bpf_iter__<iter_name>`` 97represents a BPF iterator. The suffix ``<iter_name>`` represents the type of 98iterator. 99 100:: 101 102 struct bpf_iter__task_file { 103 union { 104 struct bpf_iter_meta *meta; 105 }; 106 union { 107 struct task_struct *task; 108 }; 109 u32 fd; 110 union { 111 struct file *file; 112 }; 113 }; 114 115In the above code, the field 'meta' contains the metadata, which is the same for 116all BPF iterator programs. The rest of the fields are specific to different 117iterators. For example, for task_file iterators, the kernel layer provides the 118'task', 'fd' and 'file' field values. The 'task' and 'file' are `reference 119counted 120<https://facebookmicrosites.github.io/bpf/blog/2018/08/31/object-lifetime.html#file-descriptors-and-reference-counters>`_, 121so they won't go away when the BPF program runs. 122 123Here is a snippet from the ``bpf_iter_task_file.c`` file: 124 125:: 126 127 SEC("iter/task_file") 128 int dump_task_file(struct bpf_iter__task_file *ctx) 129 { 130 struct seq_file *seq = ctx->meta->seq; 131 struct task_struct *task = ctx->task; 132 struct file *file = ctx->file; 133 __u32 fd = ctx->fd; 134 135 if (task == NULL || file == NULL) 136 return 0; 137 138 if (ctx->meta->seq_num == 0) { 139 count = 0; 140 BPF_SEQ_PRINTF(seq, " tgid gid fd file\n"); 141 } 142 143 if (tgid == task->tgid && task->tgid != task->pid) 144 count++; 145 146 if (last_tgid != task->tgid) { 147 last_tgid = task->tgid; 148 unique_tgid_count++; 149 } 150 151 BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd, 152 (long)file->f_op); 153 return 0; 154 } 155 156In the above example, the section name ``SEC(iter/task_file)``, indicates that 157the program is a BPF iterator program to iterate all files from all tasks. The 158context of the program is ``bpf_iter__task_file`` struct. 159 160The user space program invokes the BPF iterator program running in the kernel 161by issuing a ``read()`` syscall. Once invoked, the BPF 162program can export data to user space using a variety of BPF helper functions. 163You can use either ``bpf_seq_printf()`` (and BPF_SEQ_PRINTF helper macro) or 164``bpf_seq_write()`` function based on whether you need formatted output or just 165binary data, respectively. For binary-encoded data, the user space applications 166can process the data from ``bpf_seq_write()`` as needed. For the formatted data, 167you can use ``cat <path>`` to print the results similar to ``cat 168/proc/net/netlink`` after pinning the BPF iterator to the bpffs mount. Later, 169use ``rm -f <path>`` to remove the pinned iterator. 170 171For example, you can use the following command to create a BPF iterator from the 172``bpf_iter_ipv6_route.o`` object file and pin it to the ``/sys/fs/bpf/my_route`` 173path: 174 175:: 176 177 $ bpftool iter pin ./bpf_iter_ipv6_route.o /sys/fs/bpf/my_route 178 179And then print out the results using the following command: 180 181:: 182 183 $ cat /sys/fs/bpf/my_route 184 185 186------------------------------------------------------- 187Implement Kernel Support for BPF Iterator Program Types 188------------------------------------------------------- 189 190To implement a BPF iterator in the kernel, the developer must make a one-time 191change to the following key data structure defined in the `bpf.h 192<https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/include/linux/bpf.h>`_ 193file. 194 195:: 196 197 struct bpf_iter_reg { 198 const char *target; 199 bpf_iter_attach_target_t attach_target; 200 bpf_iter_detach_target_t detach_target; 201 bpf_iter_show_fdinfo_t show_fdinfo; 202 bpf_iter_fill_link_info_t fill_link_info; 203 bpf_iter_get_func_proto_t get_func_proto; 204 u32 ctx_arg_info_size; 205 u32 feature; 206 struct bpf_ctx_arg_aux ctx_arg_info[BPF_ITER_CTX_ARG_MAX]; 207 const struct bpf_iter_seq_info *seq_info; 208 }; 209 210After filling the data structure fields, call ``bpf_iter_reg_target()`` to 211register the iterator to the main BPF iterator subsystem. 212 213The following is the breakdown for each field in struct ``bpf_iter_reg``. 214 215.. list-table:: 216 :widths: 25 50 217 :header-rows: 1 218 219 * - Fields 220 - Description 221 * - target 222 - Specifies the name of the BPF iterator. For example: ``bpf_map``, 223 ``bpf_map_elem``. The name should be different from other ``bpf_iter`` target names in the kernel. 224 * - attach_target and detach_target 225 - Allows for target specific ``link_create`` action since some targets 226 may need special processing. Called during the user space link_create stage. 227 * - show_fdinfo and fill_link_info 228 - Called to fill target specific information when user tries to get link 229 info associated with the iterator. 230 * - get_func_proto 231 - Permits a BPF iterator to access BPF helpers specific to the iterator. 232 * - ctx_arg_info_size and ctx_arg_info 233 - Specifies the verifier states for BPF program arguments associated with 234 the bpf iterator. 235 * - feature 236 - Specifies certain action requests in the kernel BPF iterator 237 infrastructure. Currently, only BPF_ITER_RESCHED is supported. This means 238 that the kernel function cond_resched() is called to avoid other kernel 239 subsystem (e.g., rcu) misbehaving. 240 * - seq_info 241 - Specifies certain action requests in the kernel BPF iterator 242 infrastructure. Currently, only BPF_ITER_RESCHED is supported. This means 243 that the kernel function cond_resched() is called to avoid other kernel 244 subsystem (e.g., rcu) misbehaving. 245 246 247`Click here 248<https://lore.kernel.org/bpf/20210212183107.50963-2-songliubraving@fb.com/>`_ 249to see an implementation of the ``task_vma`` BPF iterator in the kernel. 250 251--------------------------------- 252Parameterizing BPF Task Iterators 253--------------------------------- 254 255By default, BPF iterators walk through all the objects of the specified types 256(processes, cgroups, maps, etc.) across the entire system to read relevant 257kernel data. But often, there are cases where we only care about a much smaller 258subset of iterable kernel objects, such as only iterating tasks within a 259specific process. Therefore, BPF iterator programs support filtering out objects 260from iteration by allowing user space to configure the iterator program when it 261is attached. 262 263-------------------------- 264BPF Task Iterator Program 265-------------------------- 266 267The following code is a BPF iterator program to print files and task information 268through the ``seq_file`` of the iterator. It is a standard BPF iterator program 269that visits every file of an iterator. We will use this BPF program in our 270example later. 271 272:: 273 274 #include <vmlinux.h> 275 #include <bpf/bpf_helpers.h> 276 277 char _license[] SEC("license") = "GPL"; 278 279 SEC("iter/task_file") 280 int dump_task_file(struct bpf_iter__task_file *ctx) 281 { 282 struct seq_file *seq = ctx->meta->seq; 283 struct task_struct *task = ctx->task; 284 struct file *file = ctx->file; 285 __u32 fd = ctx->fd; 286 if (task == NULL || file == NULL) 287 return 0; 288 if (ctx->meta->seq_num == 0) { 289 BPF_SEQ_PRINTF(seq, " tgid pid fd file\n"); 290 } 291 BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd, 292 (long)file->f_op); 293 return 0; 294 } 295 296---------------------------------------- 297Creating a File Iterator with Parameters 298---------------------------------------- 299 300Now, let us look at how to create an iterator that includes only files of a 301process. 302 303First, fill the ``bpf_iter_attach_opts`` struct as shown below: 304 305:: 306 307 LIBBPF_OPTS(bpf_iter_attach_opts, opts); 308 union bpf_iter_link_info linfo; 309 memset(&linfo, 0, sizeof(linfo)); 310 linfo.task.pid = getpid(); 311 opts.link_info = &linfo; 312 opts.link_info_len = sizeof(linfo); 313 314``linfo.task.pid``, if it is non-zero, directs the kernel to create an iterator 315that only includes opened files for the process with the specified ``pid``. In 316this example, we will only be iterating files for our process. If 317``linfo.task.pid`` is zero, the iterator will visit every opened file of every 318process. Similarly, ``linfo.task.tid`` directs the kernel to create an iterator 319that visits opened files of a specific thread, not a process. In this example, 320``linfo.task.tid`` is different from ``linfo.task.pid`` only if the thread has a 321separate file descriptor table. In most circumstances, all process threads share 322a single file descriptor table. 323 324Now, in the userspace program, pass the pointer of struct to the 325``bpf_program__attach_iter()``. 326 327:: 328 329 link = bpf_program__attach_iter(prog, &opts); iter_fd = 330 bpf_iter_create(bpf_link__fd(link)); 331 332If both *tid* and *pid* are zero, an iterator created from this struct 333``bpf_iter_attach_opts`` will include every opened file of every task in the 334system (in the namespace, actually.) It is the same as passing a NULL as the 335second argument to ``bpf_program__attach_iter()``. 336 337The whole program looks like the following code: 338 339:: 340 341 #include <stdio.h> 342 #include <unistd.h> 343 #include <bpf/bpf.h> 344 #include <bpf/libbpf.h> 345 #include "bpf_iter_task_ex.skel.h" 346 347 static int do_read_opts(struct bpf_program *prog, struct bpf_iter_attach_opts *opts) 348 { 349 struct bpf_link *link; 350 char buf[16] = {}; 351 int iter_fd = -1, len; 352 int ret = 0; 353 354 link = bpf_program__attach_iter(prog, opts); 355 if (!link) { 356 fprintf(stderr, "bpf_program__attach_iter() fails\n"); 357 return -1; 358 } 359 iter_fd = bpf_iter_create(bpf_link__fd(link)); 360 if (iter_fd < 0) { 361 fprintf(stderr, "bpf_iter_create() fails\n"); 362 ret = -1; 363 goto free_link; 364 } 365 /* not check contents, but ensure read() ends without error */ 366 while ((len = read(iter_fd, buf, sizeof(buf) - 1)) > 0) { 367 buf[len] = 0; 368 printf("%s", buf); 369 } 370 printf("\n"); 371 free_link: 372 if (iter_fd >= 0) 373 close(iter_fd); 374 bpf_link__destroy(link); 375 return 0; 376 } 377 378 static void test_task_file(void) 379 { 380 LIBBPF_OPTS(bpf_iter_attach_opts, opts); 381 struct bpf_iter_task_ex *skel; 382 union bpf_iter_link_info linfo; 383 skel = bpf_iter_task_ex__open_and_load(); 384 if (skel == NULL) 385 return; 386 memset(&linfo, 0, sizeof(linfo)); 387 linfo.task.pid = getpid(); 388 opts.link_info = &linfo; 389 opts.link_info_len = sizeof(linfo); 390 printf("PID %d\n", getpid()); 391 do_read_opts(skel->progs.dump_task_file, &opts); 392 bpf_iter_task_ex__destroy(skel); 393 } 394 395 int main(int argc, const char * const * argv) 396 { 397 test_task_file(); 398 return 0; 399 } 400 401The following lines are the output of the program. 402:: 403 404 PID 1859 405 406 tgid pid fd file 407 1859 1859 0 ffffffff82270aa0 408 1859 1859 1 ffffffff82270aa0 409 1859 1859 2 ffffffff82270aa0 410 1859 1859 3 ffffffff82272980 411 1859 1859 4 ffffffff8225e120 412 1859 1859 5 ffffffff82255120 413 1859 1859 6 ffffffff82254f00 414 1859 1859 7 ffffffff82254d80 415 1859 1859 8 ffffffff8225abe0 416 417------------------ 418Without Parameters 419------------------ 420 421Let us look at how a BPF iterator without parameters skips files of other 422processes in the system. In this case, the BPF program has to check the pid or 423the tid of tasks, or it will receive every opened file in the system (in the 424current *pid* namespace, actually). So, we usually add a global variable in the 425BPF program to pass a *pid* to the BPF program. 426 427The BPF program would look like the following block. 428 429 :: 430 431 ...... 432 int target_pid = 0; 433 434 SEC("iter/task_file") 435 int dump_task_file(struct bpf_iter__task_file *ctx) 436 { 437 ...... 438 if (task->tgid != target_pid) /* Check task->pid instead to check thread IDs */ 439 return 0; 440 BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd, 441 (long)file->f_op); 442 return 0; 443 } 444 445The user space program would look like the following block: 446 447 :: 448 449 ...... 450 static void test_task_file(void) 451 { 452 ...... 453 skel = bpf_iter_task_ex__open_and_load(); 454 if (skel == NULL) 455 return; 456 skel->bss->target_pid = getpid(); /* process ID. For thread id, use gettid() */ 457 memset(&linfo, 0, sizeof(linfo)); 458 linfo.task.pid = getpid(); 459 opts.link_info = &linfo; 460 opts.link_info_len = sizeof(linfo); 461 ...... 462 } 463 464``target_pid`` is a global variable in the BPF program. The user space program 465should initialize the variable with a process ID to skip opened files of other 466processes in the BPF program. When you parametrize a BPF iterator, the iterator 467calls the BPF program fewer times which can save significant resources. 468 469--------------------------- 470Parametrizing VMA Iterators 471--------------------------- 472 473By default, a BPF VMA iterator includes every VMA in every process. However, 474you can still specify a process or a thread to include only its VMAs. Unlike 475files, a thread can not have a separate address space (since Linux 2.6.0-test6). 476Here, using *tid* makes no difference from using *pid*. 477 478---------------------------- 479Parametrizing Task Iterators 480---------------------------- 481 482A BPF task iterator with *pid* includes all tasks (threads) of a process. The 483BPF program receives these tasks one after another. You can specify a BPF task 484iterator with *tid* parameter to include only the tasks that match the given 485*tid*. 486