1=============
2BPF Iterators
3=============
4
5
6----------
7Motivation
8----------
9
10There are a few existing ways to dump kernel data into user space. The most
11popular one is the ``/proc`` system. For example, ``cat /proc/net/tcp6`` dumps
12all tcp6 sockets in the system, and ``cat /proc/net/netlink`` dumps all netlink
13sockets in the system. However, their output format tends to be fixed, and if
14users want more information about these sockets, they have to patch the kernel,
15which often takes time to publish upstream and release. The same is true for popular
16tools like `ss <https://man7.org/linux/man-pages/man8/ss.8.html>`_ where any
17additional information needs a kernel patch.
18
19To solve this problem, the `drgn
20<https://www.kernel.org/doc/html/latest/bpf/drgn.html>`_ tool is often used to
21dig out the kernel data with no kernel change. However, the main drawback for
22drgn is performance, as it cannot do pointer tracing inside the kernel. In
23addition, drgn cannot validate a pointer value and may read invalid data if the
24pointer becomes invalid inside the kernel.
25
26The BPF iterator solves the above problem by providing flexibility on what data
27(e.g., tasks, bpf_maps, etc.) to collect by calling BPF programs for each kernel
28data object.
29
30----------------------
31How BPF Iterators Work
32----------------------
33
34A BPF iterator is a type of BPF program that allows users to iterate over
35specific types of kernel objects. Unlike traditional BPF tracing programs that
36allow users to define callbacks that are invoked at particular points of
37execution in the kernel, BPF iterators allow users to define callbacks that
38should be executed for every entry in a variety of kernel data structures.
39
40For example, users can define a BPF iterator that iterates over every task on
41the system and dumps the total amount of CPU runtime currently used by each of
42them. Another BPF task iterator may instead dump the cgroup information for each
43task. Such flexibility is the core value of BPF iterators.
44
45A BPF program is always loaded into the kernel at the behest of a user space
46process. A user space process loads a BPF program by opening and initializing
47the program skeleton as required and then invoking a syscall to have the BPF
48program verified and loaded by the kernel.
49
50In traditional tracing programs, a program is activated by having user space
51obtain a ``bpf_link`` to the program with ``bpf_program__attach()``. Once
52activated, the program callback will be invoked whenever the tracepoint is
53triggered in the main kernel. For BPF iterator programs, a ``bpf_link`` to the
54program is obtained using ``bpf_link_create()``, and the program callback is
55invoked by issuing system calls from user space.
56
57Next, let us see how you can use the iterators to iterate on kernel objects and
58read data.
59
60------------------------
61How to Use BPF iterators
62------------------------
63
64BPF selftests are a great resource to illustrate how to use the iterators. In
65this section, we’ll walk through a BPF selftest which shows how to load and use
66a BPF iterator program.   To begin, we’ll look at `bpf_iter.c
67<https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/prog_tests/bpf_iter.c>`_,
68which illustrates how to load and trigger BPF iterators on the user space side.
69Later, we’ll look at a BPF program that runs in kernel space.
70
71Loading a BPF iterator in the kernel from user space typically involves the
72following steps:
73
74* The BPF program is loaded into the kernel through ``libbpf``. Once the kernel
75  has verified and loaded the program, it returns a file descriptor (fd) to user
76  space.
77* Obtain a ``link_fd`` to the BPF program by calling the ``bpf_link_create()``
78  specified with the BPF program file descriptor received from the kernel.
79* Next, obtain a BPF iterator file descriptor (``bpf_iter_fd``) by calling the
80  ``bpf_iter_create()`` specified with the ``bpf_link`` received from Step 2.
81* Trigger the iteration by calling ``read(bpf_iter_fd)`` until no data is
82  available.
83* Close the iterator fd using ``close(bpf_iter_fd)``.
84* If needed to reread the data, get a new ``bpf_iter_fd`` and do the read again.
85
86The following are a few examples of selftest BPF iterator programs:
87
88* `bpf_iter_tcp4.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_tcp4.c>`_
89* `bpf_iter_task_vma.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c>`_
90* `bpf_iter_task_file.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_file.c>`_
91
92Let us look at ``bpf_iter_task_file.c``, which runs in kernel space:
93
94Here is the definition of ``bpf_iter__task_file`` in `vmlinux.h
95<https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html#btf>`_.
96Any struct name in ``vmlinux.h`` in the format ``bpf_iter__<iter_name>``
97represents a BPF iterator. The suffix ``<iter_name>`` represents the type of
98iterator.
99
100::
101
102    struct bpf_iter__task_file {
103            union {
104                struct bpf_iter_meta *meta;
105            };
106            union {
107                struct task_struct *task;
108            };
109            u32 fd;
110            union {
111                struct file *file;
112            };
113    };
114
115In the above code, the field 'meta' contains the metadata, which is the same for
116all BPF iterator programs. The rest of the fields are specific to different
117iterators. For example, for task_file iterators, the kernel layer provides the
118'task', 'fd' and 'file' field values. The 'task' and 'file' are `reference
119counted
120<https://facebookmicrosites.github.io/bpf/blog/2018/08/31/object-lifetime.html#file-descriptors-and-reference-counters>`_,
121so they won't go away when the BPF program runs.
122
123Here is a snippet from the  ``bpf_iter_task_file.c`` file:
124
125::
126
127  SEC("iter/task_file")
128  int dump_task_file(struct bpf_iter__task_file *ctx)
129  {
130    struct seq_file *seq = ctx->meta->seq;
131    struct task_struct *task = ctx->task;
132    struct file *file = ctx->file;
133    __u32 fd = ctx->fd;
134
135    if (task == NULL || file == NULL)
136      return 0;
137
138    if (ctx->meta->seq_num == 0) {
139      count = 0;
140      BPF_SEQ_PRINTF(seq, "    tgid      gid       fd      file\n");
141    }
142
143    if (tgid == task->tgid && task->tgid != task->pid)
144      count++;
145
146    if (last_tgid != task->tgid) {
147      last_tgid = task->tgid;
148      unique_tgid_count++;
149    }
150
151    BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
152            (long)file->f_op);
153    return 0;
154  }
155
156In the above example, the section name ``SEC(iter/task_file)``, indicates that
157the program is a BPF iterator program to iterate all files from all tasks. The
158context of the program is ``bpf_iter__task_file`` struct.
159
160The user space program invokes the BPF iterator program running in the kernel
161by issuing a ``read()`` syscall. Once invoked, the BPF
162program can export data to user space using a variety of BPF helper functions.
163You can use either ``bpf_seq_printf()`` (and BPF_SEQ_PRINTF helper macro) or
164``bpf_seq_write()`` function based on whether you need formatted output or just
165binary data, respectively. For binary-encoded data, the user space applications
166can process the data from ``bpf_seq_write()`` as needed. For the formatted data,
167you can use ``cat <path>`` to print the results similar to ``cat
168/proc/net/netlink`` after pinning the BPF iterator to the bpffs mount. Later,
169use  ``rm -f <path>`` to remove the pinned iterator.
170
171For example, you can use the following command to create a BPF iterator from the
172``bpf_iter_ipv6_route.o`` object file and pin it to the ``/sys/fs/bpf/my_route``
173path:
174
175::
176
177  $ bpftool iter pin ./bpf_iter_ipv6_route.o  /sys/fs/bpf/my_route
178
179And then print out the results using the following command:
180
181::
182
183  $ cat /sys/fs/bpf/my_route
184
185
186-------------------------------------------------------
187Implement Kernel Support for BPF Iterator Program Types
188-------------------------------------------------------
189
190To implement a BPF iterator in the kernel, the developer must make a one-time
191change to the following key data structure defined in the `bpf.h
192<https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/include/linux/bpf.h>`_
193file.
194
195::
196
197  struct bpf_iter_reg {
198            const char *target;
199            bpf_iter_attach_target_t attach_target;
200            bpf_iter_detach_target_t detach_target;
201            bpf_iter_show_fdinfo_t show_fdinfo;
202            bpf_iter_fill_link_info_t fill_link_info;
203            bpf_iter_get_func_proto_t get_func_proto;
204            u32 ctx_arg_info_size;
205            u32 feature;
206            struct bpf_ctx_arg_aux ctx_arg_info[BPF_ITER_CTX_ARG_MAX];
207            const struct bpf_iter_seq_info *seq_info;
208  };
209
210After filling the data structure fields, call ``bpf_iter_reg_target()`` to
211register the iterator to the main BPF iterator subsystem.
212
213The following is the breakdown for each field in struct ``bpf_iter_reg``.
214
215.. list-table::
216   :widths: 25 50
217   :header-rows: 1
218
219   * - Fields
220     - Description
221   * - target
222     - Specifies the name of the BPF iterator. For example: ``bpf_map``,
223       ``bpf_map_elem``. The name should be different from other ``bpf_iter`` target names in the kernel.
224   * - attach_target and detach_target
225     - Allows for target specific ``link_create`` action since some targets
226       may need special processing. Called during the user space link_create stage.
227   * - show_fdinfo and fill_link_info
228     - Called to fill target specific information when user tries to get link
229       info associated with the iterator.
230   * - get_func_proto
231     - Permits a BPF iterator to access BPF helpers specific to the iterator.
232   * - ctx_arg_info_size and ctx_arg_info
233     - Specifies the verifier states for BPF program arguments associated with
234       the bpf iterator.
235   * - feature
236     - Specifies certain action requests in the kernel BPF iterator
237       infrastructure. Currently, only BPF_ITER_RESCHED is supported. This means
238       that the kernel function cond_resched() is called to avoid other kernel
239       subsystem (e.g., rcu) misbehaving.
240   * - seq_info
241     - Specifies the set of seq operations for the BPF iterator and helpers to
242       initialize/free the private data for the corresponding ``seq_file``.
243
244`Click here
245<https://lore.kernel.org/bpf/20210212183107.50963-2-songliubraving@fb.com/>`_
246to see an implementation of the ``task_vma`` BPF iterator in the kernel.
247
248---------------------------------
249Parameterizing BPF Task Iterators
250---------------------------------
251
252By default, BPF iterators walk through all the objects of the specified types
253(processes, cgroups, maps, etc.) across the entire system to read relevant
254kernel data. But often, there are cases where we only care about a much smaller
255subset of iterable kernel objects, such as only iterating tasks within a
256specific process. Therefore, BPF iterator programs support filtering out objects
257from iteration by allowing user space to configure the iterator program when it
258is attached.
259
260--------------------------
261BPF Task Iterator Program
262--------------------------
263
264The following code is a BPF iterator program to print files and task information
265through the ``seq_file`` of the iterator. It is a standard BPF iterator program
266that visits every file of an iterator. We will use this BPF program in our
267example later.
268
269::
270
271  #include <vmlinux.h>
272  #include <bpf/bpf_helpers.h>
273
274  char _license[] SEC("license") = "GPL";
275
276  SEC("iter/task_file")
277  int dump_task_file(struct bpf_iter__task_file *ctx)
278  {
279        struct seq_file *seq = ctx->meta->seq;
280        struct task_struct *task = ctx->task;
281        struct file *file = ctx->file;
282        __u32 fd = ctx->fd;
283        if (task == NULL || file == NULL)
284                return 0;
285        if (ctx->meta->seq_num == 0) {
286                BPF_SEQ_PRINTF(seq, "    tgid      pid       fd      file\n");
287        }
288        BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
289                        (long)file->f_op);
290        return 0;
291  }
292
293----------------------------------------
294Creating a File Iterator with Parameters
295----------------------------------------
296
297Now, let us look at how to create an iterator that includes only files of a
298process.
299
300First,  fill the ``bpf_iter_attach_opts`` struct as shown below:
301
302::
303
304  LIBBPF_OPTS(bpf_iter_attach_opts, opts);
305  union bpf_iter_link_info linfo;
306  memset(&linfo, 0, sizeof(linfo));
307  linfo.task.pid = getpid();
308  opts.link_info = &linfo;
309  opts.link_info_len = sizeof(linfo);
310
311``linfo.task.pid``, if it is non-zero, directs the kernel to create an iterator
312that only includes opened files for the process with the specified ``pid``. In
313this example, we will only be iterating files for our process. If
314``linfo.task.pid`` is zero, the iterator will visit every opened file of every
315process. Similarly, ``linfo.task.tid`` directs the kernel to create an iterator
316that visits opened files of a specific thread, not a process. In this example,
317``linfo.task.tid`` is different from ``linfo.task.pid`` only if the thread has a
318separate file descriptor table. In most circumstances, all process threads share
319a single file descriptor table.
320
321Now, in the userspace program, pass the pointer of struct to the
322``bpf_program__attach_iter()``.
323
324::
325
326  link = bpf_program__attach_iter(prog, &opts); iter_fd =
327  bpf_iter_create(bpf_link__fd(link));
328
329If both *tid* and *pid* are zero, an iterator created from this struct
330``bpf_iter_attach_opts`` will include every opened file of every task in the
331system (in the namespace, actually.) It is the same as passing a NULL as the
332second argument to ``bpf_program__attach_iter()``.
333
334The whole program looks like the following code:
335
336::
337
338  #include <stdio.h>
339  #include <unistd.h>
340  #include <bpf/bpf.h>
341  #include <bpf/libbpf.h>
342  #include "bpf_iter_task_ex.skel.h"
343
344  static int do_read_opts(struct bpf_program *prog, struct bpf_iter_attach_opts *opts)
345  {
346        struct bpf_link *link;
347        char buf[16] = {};
348        int iter_fd = -1, len;
349        int ret = 0;
350
351        link = bpf_program__attach_iter(prog, opts);
352        if (!link) {
353                fprintf(stderr, "bpf_program__attach_iter() fails\n");
354                return -1;
355        }
356        iter_fd = bpf_iter_create(bpf_link__fd(link));
357        if (iter_fd < 0) {
358                fprintf(stderr, "bpf_iter_create() fails\n");
359                ret = -1;
360                goto free_link;
361        }
362        /* not check contents, but ensure read() ends without error */
363        while ((len = read(iter_fd, buf, sizeof(buf) - 1)) > 0) {
364                buf[len] = 0;
365                printf("%s", buf);
366        }
367        printf("\n");
368  free_link:
369        if (iter_fd >= 0)
370                close(iter_fd);
371        bpf_link__destroy(link);
372        return 0;
373  }
374
375  static void test_task_file(void)
376  {
377        LIBBPF_OPTS(bpf_iter_attach_opts, opts);
378        struct bpf_iter_task_ex *skel;
379        union bpf_iter_link_info linfo;
380        skel = bpf_iter_task_ex__open_and_load();
381        if (skel == NULL)
382                return;
383        memset(&linfo, 0, sizeof(linfo));
384        linfo.task.pid = getpid();
385        opts.link_info = &linfo;
386        opts.link_info_len = sizeof(linfo);
387        printf("PID %d\n", getpid());
388        do_read_opts(skel->progs.dump_task_file, &opts);
389        bpf_iter_task_ex__destroy(skel);
390  }
391
392  int main(int argc, const char * const * argv)
393  {
394        test_task_file();
395        return 0;
396  }
397
398The following lines are the output of the program.
399::
400
401  PID 1859
402
403     tgid      pid       fd      file
404     1859     1859        0 ffffffff82270aa0
405     1859     1859        1 ffffffff82270aa0
406     1859     1859        2 ffffffff82270aa0
407     1859     1859        3 ffffffff82272980
408     1859     1859        4 ffffffff8225e120
409     1859     1859        5 ffffffff82255120
410     1859     1859        6 ffffffff82254f00
411     1859     1859        7 ffffffff82254d80
412     1859     1859        8 ffffffff8225abe0
413
414------------------
415Without Parameters
416------------------
417
418Let us look at how a BPF iterator without parameters skips files of other
419processes in the system. In this case, the BPF program has to check the pid or
420the tid of tasks, or it will receive every opened file in the system (in the
421current *pid* namespace, actually). So, we usually add a global variable in the
422BPF program to pass a *pid* to the BPF program.
423
424The BPF program would look like the following block.
425
426  ::
427
428    ......
429    int target_pid = 0;
430
431    SEC("iter/task_file")
432    int dump_task_file(struct bpf_iter__task_file *ctx)
433    {
434          ......
435          if (task->tgid != target_pid) /* Check task->pid instead to check thread IDs */
436                  return 0;
437          BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
438                          (long)file->f_op);
439          return 0;
440    }
441
442The user space program would look like the following block:
443
444  ::
445
446    ......
447    static void test_task_file(void)
448    {
449          ......
450          skel = bpf_iter_task_ex__open_and_load();
451          if (skel == NULL)
452                  return;
453          skel->bss->target_pid = getpid(); /* process ID.  For thread id, use gettid() */
454          memset(&linfo, 0, sizeof(linfo));
455          linfo.task.pid = getpid();
456          opts.link_info = &linfo;
457          opts.link_info_len = sizeof(linfo);
458          ......
459    }
460
461``target_pid`` is a global variable in the BPF program. The user space program
462should initialize the variable with a process ID to skip opened files of other
463processes in the BPF program. When you parametrize a BPF iterator, the iterator
464calls the BPF program fewer times which can save significant resources.
465
466---------------------------
467Parametrizing VMA Iterators
468---------------------------
469
470By default, a BPF VMA iterator includes every VMA in every process.  However,
471you can still specify a process or a thread to include only its VMAs. Unlike
472files, a thread can not have a separate address space (since Linux 2.6.0-test6).
473Here, using *tid* makes no difference from using *pid*.
474
475----------------------------
476Parametrizing Task Iterators
477----------------------------
478
479A BPF task iterator with *pid* includes all tasks (threads) of a process. The
480BPF program receives these tasks one after another. You can specify a BPF task
481iterator with *tid* parameter to include only the tasks that match the given
482*tid*.
483