1============================= 2BPF Kernel Functions (kfuncs) 3============================= 4 51. Introduction 6=============== 7 8BPF Kernel Functions or more commonly known as kfuncs are functions in the Linux 9kernel which are exposed for use by BPF programs. Unlike normal BPF helpers, 10kfuncs do not have a stable interface and can change from one kernel release to 11another. Hence, BPF programs need to be updated in response to changes in the 12kernel. 13 142. Defining a kfunc 15=================== 16 17There are two ways to expose a kernel function to BPF programs, either make an 18existing function in the kernel visible, or add a new wrapper for BPF. In both 19cases, care must be taken that BPF program can only call such function in a 20valid context. To enforce this, visibility of a kfunc can be per program type. 21 22If you are not creating a BPF wrapper for existing kernel function, skip ahead 23to :ref:`BPF_kfunc_nodef`. 24 252.1 Creating a wrapper kfunc 26---------------------------- 27 28When defining a wrapper kfunc, the wrapper function should have extern linkage. 29This prevents the compiler from optimizing away dead code, as this wrapper kfunc 30is not invoked anywhere in the kernel itself. It is not necessary to provide a 31prototype in a header for the wrapper kfunc. 32 33An example is given below:: 34 35 /* Disables missing prototype warnings */ 36 __diag_push(); 37 __diag_ignore_all("-Wmissing-prototypes", 38 "Global kfuncs as their definitions will be in BTF"); 39 40 struct task_struct *bpf_find_get_task_by_vpid(pid_t nr) 41 { 42 return find_get_task_by_vpid(nr); 43 } 44 45 __diag_pop(); 46 47A wrapper kfunc is often needed when we need to annotate parameters of the 48kfunc. Otherwise one may directly make the kfunc visible to the BPF program by 49registering it with the BPF subsystem. See :ref:`BPF_kfunc_nodef`. 50 512.2 Annotating kfunc parameters 52------------------------------- 53 54Similar to BPF helpers, there is sometime need for additional context required 55by the verifier to make the usage of kernel functions safer and more useful. 56Hence, we can annotate a parameter by suffixing the name of the argument of the 57kfunc with a __tag, where tag may be one of the supported annotations. 58 592.2.1 __sz Annotation 60--------------------- 61 62This annotation is used to indicate a memory and size pair in the argument list. 63An example is given below:: 64 65 void bpf_memzero(void *mem, int mem__sz) 66 { 67 ... 68 } 69 70Here, the verifier will treat first argument as a PTR_TO_MEM, and second 71argument as its size. By default, without __sz annotation, the size of the type 72of the pointer is used. Without __sz annotation, a kfunc cannot accept a void 73pointer. 74 752.2.2 __k Annotation 76-------------------- 77 78This annotation is only understood for scalar arguments, where it indicates that 79the verifier must check the scalar argument to be a known constant, which does 80not indicate a size parameter, and the value of the constant is relevant to the 81safety of the program. 82 83An example is given below:: 84 85 void *bpf_obj_new(u32 local_type_id__k, ...) 86 { 87 ... 88 } 89 90Here, bpf_obj_new uses local_type_id argument to find out the size of that type 91ID in program's BTF and return a sized pointer to it. Each type ID will have a 92distinct size, hence it is crucial to treat each such call as distinct when 93values don't match during verifier state pruning checks. 94 95Hence, whenever a constant scalar argument is accepted by a kfunc which is not a 96size parameter, and the value of the constant matters for program safety, __k 97suffix should be used. 98 99.. _BPF_kfunc_nodef: 100 1012.3 Using an existing kernel function 102------------------------------------- 103 104When an existing function in the kernel is fit for consumption by BPF programs, 105it can be directly registered with the BPF subsystem. However, care must still 106be taken to review the context in which it will be invoked by the BPF program 107and whether it is safe to do so. 108 1092.4 Annotating kfuncs 110--------------------- 111 112In addition to kfuncs' arguments, verifier may need more information about the 113type of kfunc(s) being registered with the BPF subsystem. To do so, we define 114flags on a set of kfuncs as follows:: 115 116 BTF_SET8_START(bpf_task_set) 117 BTF_ID_FLAGS(func, bpf_get_task_pid, KF_ACQUIRE | KF_RET_NULL) 118 BTF_ID_FLAGS(func, bpf_put_pid, KF_RELEASE) 119 BTF_SET8_END(bpf_task_set) 120 121This set encodes the BTF ID of each kfunc listed above, and encodes the flags 122along with it. Ofcourse, it is also allowed to specify no flags. 123 1242.4.1 KF_ACQUIRE flag 125--------------------- 126 127The KF_ACQUIRE flag is used to indicate that the kfunc returns a pointer to a 128refcounted object. The verifier will then ensure that the pointer to the object 129is eventually released using a release kfunc, or transferred to a map using a 130referenced kptr (by invoking bpf_kptr_xchg). If not, the verifier fails the 131loading of the BPF program until no lingering references remain in all possible 132explored states of the program. 133 1342.4.2 KF_RET_NULL flag 135---------------------- 136 137The KF_RET_NULL flag is used to indicate that the pointer returned by the kfunc 138may be NULL. Hence, it forces the user to do a NULL check on the pointer 139returned from the kfunc before making use of it (dereferencing or passing to 140another helper). This flag is often used in pairing with KF_ACQUIRE flag, but 141both are orthogonal to each other. 142 1432.4.3 KF_RELEASE flag 144--------------------- 145 146The KF_RELEASE flag is used to indicate that the kfunc releases the pointer 147passed in to it. There can be only one referenced pointer that can be passed in. 148All copies of the pointer being released are invalidated as a result of invoking 149kfunc with this flag. 150 1512.4.4 KF_KPTR_GET flag 152---------------------- 153 154The KF_KPTR_GET flag is used to indicate that the kfunc takes the first argument 155as a pointer to kptr, safely increments the refcount of the object it points to, 156and returns a reference to the user. The rest of the arguments may be normal 157arguments of a kfunc. The KF_KPTR_GET flag should be used in conjunction with 158KF_ACQUIRE and KF_RET_NULL flags. 159 1602.4.5 KF_TRUSTED_ARGS flag 161-------------------------- 162 163The KF_TRUSTED_ARGS flag is used for kfuncs taking pointer arguments. It 164indicates that the all pointer arguments are valid, and that all pointers to 165BTF objects have been passed in their unmodified form (that is, at a zero 166offset, and without having been obtained from walking another pointer). 167 168There are two types of pointers to kernel objects which are considered "valid": 169 1701. Pointers which are passed as tracepoint or struct_ops callback arguments. 1712. Pointers which were returned from a KF_ACQUIRE or KF_KPTR_GET kfunc. 172 173Pointers to non-BTF objects (e.g. scalar pointers) may also be passed to 174KF_TRUSTED_ARGS kfuncs, and may have a non-zero offset. 175 176The definition of "valid" pointers is subject to change at any time, and has 177absolutely no ABI stability guarantees. 178 1792.4.6 KF_SLEEPABLE flag 180----------------------- 181 182The KF_SLEEPABLE flag is used for kfuncs that may sleep. Such kfuncs can only 183be called by sleepable BPF programs (BPF_F_SLEEPABLE). 184 1852.4.7 KF_DESTRUCTIVE flag 186-------------------------- 187 188The KF_DESTRUCTIVE flag is used to indicate functions calling which is 189destructive to the system. For example such a call can result in system 190rebooting or panicking. Due to this additional restrictions apply to these 191calls. At the moment they only require CAP_SYS_BOOT capability, but more can be 192added later. 193 1942.4.8 KF_RCU flag 195----------------- 196 197The KF_RCU flag is used for kfuncs which have a rcu ptr as its argument. 198When used together with KF_ACQUIRE, it indicates the kfunc should have a 199single argument which must be a trusted argument or a MEM_RCU pointer. 200The argument may have reference count of 0 and the kfunc must take this 201into consideration. 202 2032.5 Registering the kfuncs 204-------------------------- 205 206Once the kfunc is prepared for use, the final step to making it visible is 207registering it with the BPF subsystem. Registration is done per BPF program 208type. An example is shown below:: 209 210 BTF_SET8_START(bpf_task_set) 211 BTF_ID_FLAGS(func, bpf_get_task_pid, KF_ACQUIRE | KF_RET_NULL) 212 BTF_ID_FLAGS(func, bpf_put_pid, KF_RELEASE) 213 BTF_SET8_END(bpf_task_set) 214 215 static const struct btf_kfunc_id_set bpf_task_kfunc_set = { 216 .owner = THIS_MODULE, 217 .set = &bpf_task_set, 218 }; 219 220 static int init_subsystem(void) 221 { 222 return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_task_kfunc_set); 223 } 224 late_initcall(init_subsystem); 225 2263. Core kfuncs 227============== 228 229The BPF subsystem provides a number of "core" kfuncs that are potentially 230applicable to a wide variety of different possible use cases and programs. 231Those kfuncs are documented here. 232 2333.1 struct task_struct * kfuncs 234------------------------------- 235 236There are a number of kfuncs that allow ``struct task_struct *`` objects to be 237used as kptrs: 238 239.. kernel-doc:: kernel/bpf/helpers.c 240 :identifiers: bpf_task_acquire bpf_task_release 241 242These kfuncs are useful when you want to acquire or release a reference to a 243``struct task_struct *`` that was passed as e.g. a tracepoint arg, or a 244struct_ops callback arg. For example: 245 246.. code-block:: c 247 248 /** 249 * A trivial example tracepoint program that shows how to 250 * acquire and release a struct task_struct * pointer. 251 */ 252 SEC("tp_btf/task_newtask") 253 int BPF_PROG(task_acquire_release_example, struct task_struct *task, u64 clone_flags) 254 { 255 struct task_struct *acquired; 256 257 acquired = bpf_task_acquire(task); 258 259 /* 260 * In a typical program you'd do something like store 261 * the task in a map, and the map will automatically 262 * release it later. Here, we release it manually. 263 */ 264 bpf_task_release(acquired); 265 return 0; 266 } 267 268---- 269 270A BPF program can also look up a task from a pid. This can be useful if the 271caller doesn't have a trusted pointer to a ``struct task_struct *`` object that 272it can acquire a reference on with bpf_task_acquire(). 273 274.. kernel-doc:: kernel/bpf/helpers.c 275 :identifiers: bpf_task_from_pid 276 277Here is an example of it being used: 278 279.. code-block:: c 280 281 SEC("tp_btf/task_newtask") 282 int BPF_PROG(task_get_pid_example, struct task_struct *task, u64 clone_flags) 283 { 284 struct task_struct *lookup; 285 286 lookup = bpf_task_from_pid(task->pid); 287 if (!lookup) 288 /* A task should always be found, as %task is a tracepoint arg. */ 289 return -ENOENT; 290 291 if (lookup->pid != task->pid) { 292 /* bpf_task_from_pid() looks up the task via its 293 * globally-unique pid from the init_pid_ns. Thus, 294 * the pid of the lookup task should always be the 295 * same as the input task. 296 */ 297 bpf_task_release(lookup); 298 return -EINVAL; 299 } 300 301 /* bpf_task_from_pid() returns an acquired reference, 302 * so it must be dropped before returning from the 303 * tracepoint handler. 304 */ 305 bpf_task_release(lookup); 306 return 0; 307 } 308 3093.2 struct cgroup * kfuncs 310-------------------------- 311 312``struct cgroup *`` objects also have acquire and release functions: 313 314.. kernel-doc:: kernel/bpf/helpers.c 315 :identifiers: bpf_cgroup_acquire bpf_cgroup_release 316 317These kfuncs are used in exactly the same manner as bpf_task_acquire() and 318bpf_task_release() respectively, so we won't provide examples for them. 319 320---- 321 322You may also acquire a reference to a ``struct cgroup`` kptr that's already 323stored in a map using bpf_cgroup_kptr_get(): 324 325.. kernel-doc:: kernel/bpf/helpers.c 326 :identifiers: bpf_cgroup_kptr_get 327 328Here's an example of how it can be used: 329 330.. code-block:: c 331 332 /* struct containing the struct task_struct kptr which is actually stored in the map. */ 333 struct __cgroups_kfunc_map_value { 334 struct cgroup __kptr_ref * cgroup; 335 }; 336 337 /* The map containing struct __cgroups_kfunc_map_value entries. */ 338 struct { 339 __uint(type, BPF_MAP_TYPE_HASH); 340 __type(key, int); 341 __type(value, struct __cgroups_kfunc_map_value); 342 __uint(max_entries, 1); 343 } __cgroups_kfunc_map SEC(".maps"); 344 345 /* ... */ 346 347 /** 348 * A simple example tracepoint program showing how a 349 * struct cgroup kptr that is stored in a map can 350 * be acquired using the bpf_cgroup_kptr_get() kfunc. 351 */ 352 SEC("tp_btf/cgroup_mkdir") 353 int BPF_PROG(cgroup_kptr_get_example, struct cgroup *cgrp, const char *path) 354 { 355 struct cgroup *kptr; 356 struct __cgroups_kfunc_map_value *v; 357 s32 id = cgrp->self.id; 358 359 /* Assume a cgroup kptr was previously stored in the map. */ 360 v = bpf_map_lookup_elem(&__cgroups_kfunc_map, &id); 361 if (!v) 362 return -ENOENT; 363 364 /* Acquire a reference to the cgroup kptr that's already stored in the map. */ 365 kptr = bpf_cgroup_kptr_get(&v->cgroup); 366 if (!kptr) 367 /* If no cgroup was present in the map, it's because 368 * we're racing with another CPU that removed it with 369 * bpf_kptr_xchg() between the bpf_map_lookup_elem() 370 * above, and our call to bpf_cgroup_kptr_get(). 371 * bpf_cgroup_kptr_get() internally safely handles this 372 * race, and will return NULL if the task is no longer 373 * present in the map by the time we invoke the kfunc. 374 */ 375 return -EBUSY; 376 377 /* Free the reference we just took above. Note that the 378 * original struct cgroup kptr is still in the map. It will 379 * be freed either at a later time if another context deletes 380 * it from the map, or automatically by the BPF subsystem if 381 * it's still present when the map is destroyed. 382 */ 383 bpf_cgroup_release(kptr); 384 385 return 0; 386 } 387 388---- 389 390Another kfunc available for interacting with ``struct cgroup *`` objects is 391bpf_cgroup_ancestor(). This allows callers to access the ancestor of a cgroup, 392and return it as a cgroup kptr. 393 394.. kernel-doc:: kernel/bpf/helpers.c 395 :identifiers: bpf_cgroup_ancestor 396 397Eventually, BPF should be updated to allow this to happen with a normal memory 398load in the program itself. This is currently not possible without more work in 399the verifier. bpf_cgroup_ancestor() can be used as follows: 400 401.. code-block:: c 402 403 /** 404 * Simple tracepoint example that illustrates how a cgroup's 405 * ancestor can be accessed using bpf_cgroup_ancestor(). 406 */ 407 SEC("tp_btf/cgroup_mkdir") 408 int BPF_PROG(cgrp_ancestor_example, struct cgroup *cgrp, const char *path) 409 { 410 struct cgroup *parent; 411 412 /* The parent cgroup resides at the level before the current cgroup's level. */ 413 parent = bpf_cgroup_ancestor(cgrp, cgrp->level - 1); 414 if (!parent) 415 return -ENOENT; 416 417 bpf_printk("Parent id is %d", parent->self.id); 418 419 /* Return the parent cgroup that was acquired above. */ 420 bpf_cgroup_release(parent); 421 return 0; 422 } 423