Revision tags: v6.6.26, v6.6.25, v6.6.24 |
|
#
876941f5 |
| 28-Mar-2024 |
Andrii Nakryiko <andrii@kernel.org> |
bpf: support deferring bpf_link dealloc to after RCU grace period
commit 1a80dbcb2dbaf6e4c216e62e30fa7d3daa8001ce upstream.
BPF link for some program types is passed as a "context" which can be use
bpf: support deferring bpf_link dealloc to after RCU grace period
commit 1a80dbcb2dbaf6e4c216e62e30fa7d3daa8001ce upstream.
BPF link for some program types is passed as a "context" which can be used by those BPF programs to look up additional information. E.g., for multi-kprobes and multi-uprobes, link is used to fetch BPF cookie values.
Because of this runtime dependency, when bpf_link refcnt drops to zero there could still be active BPF programs running accessing link data.
This patch adds generic support to defer bpf_link dealloc callback to after RCU GP, if requested. This is done by exposing two different deallocation callbacks, one synchronous and one deferred. If deferred one is provided, bpf_link_free() will schedule dealloc_deferred() callback to happen after RCU GP.
BPF is using two flavors of RCU: "classic" non-sleepable one and RCU tasks trace one. The latter is used when sleepable BPF programs are used. bpf_link_free() accommodates that by checking underlying BPF program's sleepable flag, and goes either through normal RCU GP only for non-sleepable, or through RCU tasks trace GP *and* then normal RCU GP (taking into account rcu_trace_implies_rcu_gp() optimization), if BPF program is sleepable.
We use this for multi-kprobe and multi-uprobe links, which dereference link during program run. We also preventively switch raw_tp link to use deferred dealloc callback, as upcoming changes in bpf-next tree expose raw_tp link data (specifically, cookie value) to BPF program at runtime as well.
Fixes: 0dcac2725406 ("bpf: Add multi kprobe link") Fixes: 89ae89f53d20 ("bpf: Add multi uprobe link") Reported-by: syzbot+981935d9485a560bfbcb@syzkaller.appspotmail.com Reported-by: syzbot+2cb5a6c573e98db598cc@syzkaller.appspotmail.com Reported-by: syzbot+62d8b26793e8a2bd0516@syzkaller.appspotmail.com Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20240328052426.3042617-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
Revision tags: v6.6.23, v6.6.16, v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5 |
|
#
f91cd728 |
| 04-Dec-2023 |
Hou Tao <houtao1@huawei.com> |
bpf: Defer the free of inner map when necessary
[ Upstream commit 876673364161da50eed6b472d746ef88242b2368 ]
When updating or deleting an inner map in map array or map htab, the map may still be ac
bpf: Defer the free of inner map when necessary
[ Upstream commit 876673364161da50eed6b472d746ef88242b2368 ]
When updating or deleting an inner map in map array or map htab, the map may still be accessed by non-sleepable program or sleepable program. However bpf_map_fd_put_ptr() decreases the ref-counter of the inner map directly through bpf_map_put(), if the ref-counter is the last one (which is true for most cases), the inner map will be freed by ops->map_free() in a kworker. But for now, most .map_free() callbacks don't use synchronize_rcu() or its variants to wait for the elapse of a RCU grace period, so after the invocation of ops->map_free completes, the bpf program which is accessing the inner map may incur use-after-free problem.
Fix the free of inner map by invoking bpf_map_free_deferred() after both one RCU grace period and one tasks trace RCU grace period if the inner map has been removed from the outer map before. The deferment is accomplished by using call_rcu() or call_rcu_tasks_trace() when releasing the last ref-counter of bpf map. The newly-added rcu_head field in bpf_map shares the same storage space with work field to reduce the size of bpf_map.
Fixes: bba1dc0b55ac ("bpf: Remove redundant synchronize_rcu.") Fixes: 638e4b825d52 ("bpf: Allows per-cpu maps and map-in-map in sleepable programs") Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231204140425.1480317-5-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
1c40ec6b |
| 04-Dec-2023 |
Hou Tao <houtao1@huawei.com> |
bpf: Add map and need_defer parameters to .map_fd_put_ptr()
[ Upstream commit 20c20bd11a0702ce4dc9300c3da58acf551d9725 ]
map is the pointer of outer map, and need_defer needs some explanation. need
bpf: Add map and need_defer parameters to .map_fd_put_ptr()
[ Upstream commit 20c20bd11a0702ce4dc9300c3da58acf551d9725 ]
map is the pointer of outer map, and need_defer needs some explanation. need_defer tells the implementation to defer the reference release of the passed element and ensure that the element is still alive before the bpf program, which may manipulate it, exits.
The following three cases will invoke map_fd_put_ptr() and different need_defer values will be passed to these callers:
1) release the reference of the old element in the map during map update or map deletion. The release must be deferred, otherwise the bpf program may incur use-after-free problem, so need_defer needs to be true. 2) release the reference of the to-be-added element in the error path of map update. The to-be-added element is not visible to any bpf program, so it is OK to pass false for need_defer parameter. 3) release the references of all elements in the map during map release. Any bpf program which has access to the map must have been exited and released, so need_defer=false will be OK.
These two parameters will be used by the following patches to fix the potential use-after-free problem for map-in-map.
Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231204140425.1480317-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Stable-dep-of: 876673364161 ("bpf: Defer the free of inner map when necessary") Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
f64b2dc8 |
| 06-Dec-2023 |
Jiri Olsa <jolsa@kernel.org> |
bpf: Fix prog_array_map_poke_run map poke update
commit 4b7de801606e504e69689df71475d27e35336fb3 upstream.
Lee pointed out issue found by syscaller [0] hitting BUG in prog array map poke update in
bpf: Fix prog_array_map_poke_run map poke update
commit 4b7de801606e504e69689df71475d27e35336fb3 upstream.
Lee pointed out issue found by syscaller [0] hitting BUG in prog array map poke update in prog_array_map_poke_run function due to error value returned from bpf_arch_text_poke function.
There's race window where bpf_arch_text_poke can fail due to missing bpf program kallsym symbols, which is accounted for with check for -EINVAL in that BUG_ON call.
The problem is that in such case we won't update the tail call jump and cause imbalance for the next tail call update check which will fail with -EBUSY in bpf_arch_text_poke.
I'm hitting following race during the program load:
CPU 0 CPU 1
bpf_prog_load bpf_check do_misc_fixups prog_array_map_poke_track
map_update_elem bpf_fd_array_map_update_elem prog_array_map_poke_run
bpf_arch_text_poke returns -EINVAL
bpf_prog_kallsyms_add
After bpf_arch_text_poke (CPU 1) fails to update the tail call jump, the next poke update fails on expected jump instruction check in bpf_arch_text_poke with -EBUSY and triggers the BUG_ON in prog_array_map_poke_run.
Similar race exists on the program unload.
Fixing this by moving the update to bpf_arch_poke_desc_update function which makes sure we call __bpf_arch_text_poke that skips the bpf address check.
Each architecture has slightly different approach wrt looking up bpf address in bpf_arch_text_poke, so instead of splitting the function or adding new 'checkip' argument in previous version, it seems best to move the whole map_poke_run update as arch specific code.
[0] https://syzkaller.appspot.com/bug?extid=97a4fe20470e9bc30810
Fixes: ebf7d1f508a7 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT") Reported-by: syzbot+97a4fe20470e9bc30810@syzkaller.appspotmail.com Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Cc: Lee Jones <lee@kernel.org> Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20231206083041.1306660-2-jolsa@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
Revision tags: v6.6.4, v6.6.3, v6.6.2 |
|
#
3c49b49d |
| 09-Nov-2023 |
Andrii Nakryiko <andrii@kernel.org> |
bpf: handle ldimm64 properly in check_cfg()
[ Upstream commit 3feb263bb516ee7e1da0acd22b15afbb9a7daa19 ]
ldimm64 instructions are 16-byte long, and so have to be handled appropriately in check_cfg(
bpf: handle ldimm64 properly in check_cfg()
[ Upstream commit 3feb263bb516ee7e1da0acd22b15afbb9a7daa19 ]
ldimm64 instructions are 16-byte long, and so have to be handled appropriately in check_cfg(), just like the rest of BPF verifier does.
This has implications in three places: - when determining next instruction for non-jump instructions; - when determining next instruction for callback address ldimm64 instructions (in visit_func_call_insn()); - when checking for unreachable instructions, where second half of ldimm64 is expected to be unreachable;
We take this also as an opportunity to report jump into the middle of ldimm64. And adjust few test_verifier tests accordingly.
Acked-by: Eduard Zingerman <eddyz87@gmail.com> Reported-by: Hao Sun <sunhao.th@gmail.com> Fixes: 475fb78fbf48 ("bpf: verifier (add branch/goto checks)") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231110002638.4168352-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
Revision tags: v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8, v6.5.7, v6.5.6, v6.5.5, v6.5.4, v6.5.3 |
|
#
8f873cc3 |
| 12-Sep-2023 |
Leon Hwang <hffilwlqm@gmail.com> |
bpf, x64: Fix tailcall infinite loop
[ Upstream commit 2b5dcb31a19a2e0acd869b12c9db9b2d696ef544 ]
From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT"), the ta
bpf, x64: Fix tailcall infinite loop
[ Upstream commit 2b5dcb31a19a2e0acd869b12c9db9b2d696ef544 ]
From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT"), the tailcall on x64 works better than before.
From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms for x64 JIT"), tailcall is able to run in BPF subprograms on x64.
From commit 5b92a28aae4dd0f8 ("bpf: Support attaching tracing BPF program to other BPF programs"), BPF program is able to trace other BPF programs.
How about combining them all together?
1. FENTRY/FEXIT on a BPF subprogram. 2. A tailcall runs in the BPF subprogram. 3. The tailcall calls the subprogram's caller.
As a result, a tailcall infinite loop comes up. And the loop would halt the machine.
As we know, in tail call context, the tail_call_cnt propagates by stack and rax register between BPF subprograms. So do in trampolines.
Fixes: ebf7d1f508a7 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT") Fixes: e411901c0b77 ("bpf: allow for tailcalls in BPF subprograms for x64 JIT") Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Leon Hwang <hffilwlqm@gmail.com> Link: https://lore.kernel.org/r/20230912150442.2009-3-hffilwlqm@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
b724a641 |
| 17-Sep-2023 |
Leon Hwang <hffilwlqm@gmail.com> |
bpf: Fix tr dereferencing
Fix 'tr' dereferencing bug when CONFIG_BPF_JIT is turned off.
When CONFIG_BPF_JIT is turned off, 'bpf_trampoline_get()' returns NULL, which is same as the cases when CONFI
bpf: Fix tr dereferencing
Fix 'tr' dereferencing bug when CONFIG_BPF_JIT is turned off.
When CONFIG_BPF_JIT is turned off, 'bpf_trampoline_get()' returns NULL, which is same as the cases when CONFIG_BPF_JIT is turned on.
Closes: https://lore.kernel.org/r/202309131936.5Nc8eUD0-lkp@intel.com/ Fixes: f7b12b6fea00 ("bpf: verifier: refactor check_attach_btf_id()") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Leon Hwang <hffilwlqm@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230917153846.88732-1-hffilwlqm@gmail.com
show more ...
|
Revision tags: v6.5.2, v6.1.51, v6.5.1, v6.1.50 |
|
#
6a86b5b5 |
| 29-Aug-2023 |
Daniel Borkmann <daniel@iogearbox.net> |
bpf: Annotate bpf_long_memcpy with data_race
syzbot reported a data race splat between two processes trying to update the same BPF map value via syscall on different CPUs:
BUG: KCSAN: data-race i
bpf: Annotate bpf_long_memcpy with data_race
syzbot reported a data race splat between two processes trying to update the same BPF map value via syscall on different CPUs:
BUG: KCSAN: data-race in bpf_percpu_array_update / bpf_percpu_array_update
write to 0xffffe8fffe7425d8 of 8 bytes by task 8257 on cpu 1: bpf_long_memcpy include/linux/bpf.h:428 [inline] bpf_obj_memcpy include/linux/bpf.h:441 [inline] copy_map_value_long include/linux/bpf.h:464 [inline] bpf_percpu_array_update+0x3bb/0x500 kernel/bpf/arraymap.c:380 bpf_map_update_value+0x190/0x370 kernel/bpf/syscall.c:175 generic_map_update_batch+0x3ae/0x4f0 kernel/bpf/syscall.c:1749 bpf_map_do_batch+0x2df/0x3d0 kernel/bpf/syscall.c:4648 __sys_bpf+0x28a/0x780 __do_sys_bpf kernel/bpf/syscall.c:5241 [inline] __se_sys_bpf kernel/bpf/syscall.c:5239 [inline] __x64_sys_bpf+0x43/0x50 kernel/bpf/syscall.c:5239 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd
write to 0xffffe8fffe7425d8 of 8 bytes by task 8268 on cpu 0: bpf_long_memcpy include/linux/bpf.h:428 [inline] bpf_obj_memcpy include/linux/bpf.h:441 [inline] copy_map_value_long include/linux/bpf.h:464 [inline] bpf_percpu_array_update+0x3bb/0x500 kernel/bpf/arraymap.c:380 bpf_map_update_value+0x190/0x370 kernel/bpf/syscall.c:175 generic_map_update_batch+0x3ae/0x4f0 kernel/bpf/syscall.c:1749 bpf_map_do_batch+0x2df/0x3d0 kernel/bpf/syscall.c:4648 __sys_bpf+0x28a/0x780 __do_sys_bpf kernel/bpf/syscall.c:5241 [inline] __se_sys_bpf kernel/bpf/syscall.c:5239 [inline] __x64_sys_bpf+0x43/0x50 kernel/bpf/syscall.c:5239 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd
value changed: 0x0000000000000000 -> 0xfffffff000002788
The bpf_long_memcpy is used with 8-byte aligned pointers, power-of-8 size and forced to use long read/writes to try to atomically copy long counters. It is best-effort only and no barriers are here since it _will_ race with concurrent updates from BPF programs. The bpf_long_memcpy() is called from bpf(2) syscall. Marco suggested that the best way to make this known to KCSAN would be to use data_race() annotation.
Reported-by: syzbot+97522333291430dd277f@syzkaller.appspotmail.com Suggested-by: Marco Elver <elver@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Marco Elver <elver@google.com> Link: https://lore.kernel.org/bpf/000000000000d87a7f06040c970c@google.com Link: https://lore.kernel.org/bpf/57628f7a15e20d502247c3b55fceb1cb2b31f266.1693342186.git.daniel@iogearbox.net
show more ...
|
Revision tags: v6.5, v6.1.49, v6.1.48 |
|
#
0816b8c6 |
| 21-Aug-2023 |
Dave Marchevsky <davemarchevsky@fb.com> |
bpf: Consider non-owning refs to refcounted nodes RCU protected
An earlier patch in the series ensures that the underlying memory of nodes with bpf_refcount - which can have multiple owners - is not
bpf: Consider non-owning refs to refcounted nodes RCU protected
An earlier patch in the series ensures that the underlying memory of nodes with bpf_refcount - which can have multiple owners - is not reused until RCU grace period has elapsed. This prevents use-after-free with non-owning references that may point to recently-freed memory. While RCU read lock is held, it's safe to dereference such a non-owning ref, as by definition RCU GP couldn't have elapsed and therefore underlying memory couldn't have been reused.
From the perspective of verifier "trustedness" non-owning refs to refcounted nodes are now trusted only in RCU CS and therefore should no longer pass is_trusted_reg, but rather is_rcu_reg. Let's mark them MEM_RCU in order to reflect this new state.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230821193311.3290257-6-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
Revision tags: v6.1.46 |
|
#
bb48cf16 |
| 14-Aug-2023 |
David Vernet <void@manifault.com> |
bpf: Document struct bpf_struct_ops fields
Subsystems that want to implement a struct bpf_struct_ops structure to enable struct_ops maps must currently reverse engineer how the structure works. Give
bpf: Document struct bpf_struct_ops fields
Subsystems that want to implement a struct bpf_struct_ops structure to enable struct_ops maps must currently reverse engineer how the structure works. Given that this is meant to be a way for subsystem maintainers to extend their subsystems using BPF, let's document it to make it a bit easier on them.
Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230814185908.700553-3-void@manifault.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
show more ...
|
Revision tags: v6.1.45 |
|
#
01b85396 |
| 09-Aug-2023 |
Yue Haibing <yuehaibing@huawei.com> |
bpf: Remove unused declaration bpf_link_new_file()
Commit a3b80e107894 ("bpf: Allocate ID for bpf_link") removed the implementation but not the declaration.
Signed-off-by: Yue Haibing <yuehaibing@h
bpf: Remove unused declaration bpf_link_new_file()
Commit a3b80e107894 ("bpf: Allocate ID for bpf_link") removed the implementation but not the declaration.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Link: https://lore.kernel.org/r/20230809140556.45836-1-yuehaibing@huawei.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
show more ...
|
Revision tags: v6.1.44 |
|
#
a3c485a5 |
| 07-Aug-2023 |
Jiri Olsa <jolsa@kernel.org> |
bpf: Add support for bpf_get_func_ip helper for uprobe program
Adding support for bpf_get_func_ip helper for uprobe program to return probed address for both uprobe and return uprobe.
We discussed
bpf: Add support for bpf_get_func_ip helper for uprobe program
Adding support for bpf_get_func_ip helper for uprobe program to return probed address for both uprobe and return uprobe.
We discussed this in [1] and agreed that uprobe can have special use of bpf_get_func_ip helper that differs from kprobe.
The kprobe bpf_get_func_ip returns: - address of the function if probe is attach on function entry for both kprobe and return kprobe - 0 if the probe is not attach on function entry
The uprobe bpf_get_func_ip returns: - address of the probe for both uprobe and return uprobe
The reason for this semantic change is that kernel can't really tell if the probe user space address is function entry.
The uprobe program is actually kprobe type program attached as uprobe. One of the consequences of this design is that uprobes do not have its own set of helpers, but share them with kprobes.
As we need different functionality for bpf_get_func_ip helper for uprobe, I'm adding the bool value to the bpf_trace_run_ctx, so the helper can detect that it's executed in uprobe context and call specific code.
The is_uprobe bool is set as true in bpf_prog_run_array_sleepable, which is currently used only for executing bpf programs in uprobe.
Renaming bpf_prog_run_array_sleepable to bpf_prog_run_array_uprobe to address that it's only used for uprobes and that it sets the run_ctx.is_uprobe as suggested by Yafang Shao.
Suggested-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Alan Maguire <alan.maguire@oracle.com> [1] https://lore.kernel.org/bpf/CAEf4BzZ=xLVkG5eurEuvLU79wAMtwho7ReR+XJAgwhFF4M-7Cg@mail.gmail.com/ Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Viktor Malik <vmalik@redhat.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230807085956.2344866-2-jolsa@kernel.org Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
show more ...
|
Revision tags: v6.1.43 |
|
#
6a5a148a |
| 01-Aug-2023 |
Arnd Bergmann <arnd@arndb.de> |
bpf: fix bpf_probe_read_kernel prototype mismatch
bpf_probe_read_kernel() has a __weak definition in core.c and another definition with an incompatible prototype in kernel/trace/bpf_trace.c, when CO
bpf: fix bpf_probe_read_kernel prototype mismatch
bpf_probe_read_kernel() has a __weak definition in core.c and another definition with an incompatible prototype in kernel/trace/bpf_trace.c, when CONFIG_BPF_EVENTS is enabled.
Since the two are incompatible, there cannot be a shared declaration in a header file, but the lack of a prototype causes a W=1 warning:
kernel/bpf/core.c:1638:12: error: no previous prototype for 'bpf_probe_read_kernel' [-Werror=missing-prototypes]
On 32-bit architectures, the local prototype
u64 __weak bpf_probe_read_kernel(void *dst, u32 size, const void *unsafe_ptr)
passes arguments in other registers as the one in bpf_trace.c
BPF_CALL_3(bpf_probe_read_kernel, void *, dst, u32, size, const void *, unsafe_ptr)
which uses 64-bit arguments in pairs of registers.
As both versions of the function are fairly simple and only really differ in one line, just move them into a header file as an inline function that does not add any overhead for the bpf_trace.c callers and actually avoids a function call for the other one.
Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/ac25cb0f-b804-1649-3afb-1dc6138c2716@iogearbox.net/ Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230801111449.185301-1-arnd@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
Revision tags: v6.1.42, v6.1.41, v6.1.40, v6.1.39 |
|
#
c3c510ce |
| 18-Jul-2023 |
Dave Marchevsky <davemarchevsky@fb.com> |
bpf: Add 'owner' field to bpf_{list,rb}_node
As described by Kumar in [0], in shared ownership scenarios it is necessary to do runtime tracking of {rb,list} node ownership - and synchronize updates
bpf: Add 'owner' field to bpf_{list,rb}_node
As described by Kumar in [0], in shared ownership scenarios it is necessary to do runtime tracking of {rb,list} node ownership - and synchronize updates using this ownership information - in order to prevent races. This patch adds an 'owner' field to struct bpf_list_node and bpf_rb_node to implement such runtime tracking.
The owner field is a void * that describes the ownership state of a node. It can have the following values:
NULL - the node is not owned by any data structure BPF_PTR_POISON - the node is in the process of being added to a data structure ptr_to_root - the pointee is a data structure 'root' (bpf_rb_root / bpf_list_head) which owns this node
The field is initially NULL (set by bpf_obj_init_field default behavior) and transitions states in the following sequence:
Insertion: NULL -> BPF_PTR_POISON -> ptr_to_root Removal: ptr_to_root -> NULL
Before a node has been successfully inserted, it is not protected by any root's lock, and therefore two programs can attempt to add the same node to different roots simultaneously. For this reason the intermediate BPF_PTR_POISON state is necessary. For removal, the node is protected by some root's lock so this intermediate hop isn't necessary.
Note that bpf_list_pop_{front,back} helpers don't need to check owner before removing as the node-to-be-removed is not passed in as input and is instead taken directly from the list. Do the check anyways and WARN_ON_ONCE in this unexpected scenario.
Selftest changes in this patch are entirely mechanical: some BTF tests have hardcoded struct sizes for structs that contain bpf_{list,rb}_node fields, those were adjusted to account for the new sizes. Selftest additions to validate the owner field are added in a further patch in the series.
[0]: https://lore.kernel.org/bpf/d7hyspcow5wtjcmw4fugdgyp3fwhljwuscp3xyut5qnwivyeru@ysdq543otzv2
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Suggested-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230718083813.3416104-4-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
#
0a1f7bfe |
| 18-Jul-2023 |
Dave Marchevsky <davemarchevsky@fb.com> |
bpf: Introduce internal definitions for UAPI-opaque bpf_{rb,list}_node
Structs bpf_rb_node and bpf_list_node are opaquely defined in uapi/linux/bpf.h, as BPF program writers are not expected to touc
bpf: Introduce internal definitions for UAPI-opaque bpf_{rb,list}_node
Structs bpf_rb_node and bpf_list_node are opaquely defined in uapi/linux/bpf.h, as BPF program writers are not expected to touch their fields - nor does the verifier allow them to do so.
Currently these structs are simple wrappers around structs rb_node and list_head and linked_list / rbtree implementation just casts and passes to library functions for those data structures. Later patches in this series, though, will add an "owner" field to bpf_{rb,list}_node, such that they're not just wrapping an underlying node type. Moreover, the bpf linked_list and rbtree implementations will deal with these owner pointers directly in a few different places.
To avoid having to do
void *owner = (void*)bpf_list_node + sizeof(struct list_head)
with opaque UAPI node types, add bpf_{list,rb}_node_kern struct definitions to internal headers and modify linked_list and rbtree to use the internal types where appropriate.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230718083813.3416104-3-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
#
25954730 |
| 06-Jul-2023 |
Anton Protopopov <aspsk@isovalent.com> |
bpf: add percpu stats for bpf_map elements insertions/deletions
Add a generic percpu stats for bpf_map elements insertions/deletions in order to keep track of both, the current (approximate) number
bpf: add percpu stats for bpf_map elements insertions/deletions
Add a generic percpu stats for bpf_map elements insertions/deletions in order to keep track of both, the current (approximate) number of elements in a map and per-cpu statistics on update/delete operations.
To expose these stats a particular map implementation should initialize the counter and adjust it as needed using the 'bpf_map_*_elem_count' helpers provided by this commit.
Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Link: https://lore.kernel.org/r/20230706133932.45883-2-aspsk@isovalent.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
Revision tags: v6.1.38, v6.1.37, v6.1.36, v6.4, v6.1.35, v6.1.34, v6.1.33, v6.1.32, v6.1.31, v6.1.30, v6.1.29 |
|
#
cb8edce2 |
| 15-May-2023 |
Andrii Nakryiko <andrii@kernel.org> |
bpf: Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall forces users to specify pinning location as a string-based absol
bpf: Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall forces users to specify pinning location as a string-based absolute or relative (to current working directory) path. This has various implications related to security (e.g., symlink-based attacks), forces BPF FS to be exposed in the file system, which can cause races with other applications.
One of the feedbacks we got from folks working with containers heavily was that inability to use purely FD-based location specification was an unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET commands. This patch closes this oversight, adding path_fd field to BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by *at() syscalls for dirfd + pathname combinations.
This now allows interesting possibilities like working with detached BPF FS mount (e.g., to perform multiple pinnings without running a risk of someone interfering with them), and generally making pinning/getting more secure and not prone to any races and/or security attacks.
This is demonstrated by a selftest added in subsequent patch that takes advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate creating detached BPF FS mount, pinning, and then getting BPF map out of it, all while never exposing this private instance of BPF FS to outside worlds.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/bpf/20230523170013.728457-4-andrii@kernel.org
show more ...
|
#
47e79cbe |
| 15-May-2023 |
Yafang Shao <laoar.shao@gmail.com> |
bpf: Remove bpf trampoline selector
After commit e21aa341785c ("bpf: Fix fexit trampoline."), the selector is only used to indicate how many times the bpf trampoline image are updated and been displ
bpf: Remove bpf trampoline selector
After commit e21aa341785c ("bpf: Fix fexit trampoline."), the selector is only used to indicate how many times the bpf trampoline image are updated and been displayed in the trampoline ksym name. After the trampoline is freed, the selector will start from 0 again. So the selector is a useless value to the user. We can remove it.
If the user want to check whether the bpf trampoline image has been updated or not, the user can compare the address. Each time the trampoline image is updated, the address will change consequently. Jiri also pointed out another issue that perf is still using the old name "bpf_trampoline_%lu", so this change can fix the issue in perf.
Fixes: e21aa341785c ("bpf: Fix fexit trampoline.") Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <song@kernel.org> Cc: Jiri Olsa <olsajiri@gmail.com> Link: https://lore.kernel.org/bpf/ZFvOOlrmHiY9AgXE@krava Link: https://lore.kernel.org/bpf/20230515130849.57502-3-laoar.shao@gmail.com
show more ...
|
Revision tags: v6.1.28, v6.1.27, v6.1.26, v6.3, v6.1.25 |
|
#
26662d73 |
| 20-Apr-2023 |
Joanne Koong <joannelkoong@gmail.com> |
bpf: Add bpf_dynptr_size
bpf_dynptr_size returns the number of usable bytes in a dynptr.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Sign
bpf: Add bpf_dynptr_size
bpf_dynptr_size returns the number of usable bytes in a dynptr.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20230420071414.570108-4-joannelkoong@gmail.com
show more ...
|
#
2b99ef22 |
| 21-Apr-2023 |
Florian Westphal <fw@strlen.de> |
bpf: add test_run support for netfilter program type
add glue code so a bpf program can be run using userspace-provided netfilter state and packet/skb.
Default is to use ipv4:output hook point, but
bpf: add test_run support for netfilter program type
add glue code so a bpf program can be run using userspace-provided netfilter state and packet/skb.
Default is to use ipv4:output hook point, but this can be overridden by userspace. Userspace provided netfilter state is restricted, only hook and protocol families can be overridden and only to ipv4/ipv6.
Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20230421170300.24115-7-fw@strlen.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
#
3e81740a |
| 15-Apr-2023 |
Dave Marchevsky <davemarchevsky@fb.com> |
bpf: Centralize btf_field-specific initialization logic
All btf_fields in an object are 0-initialized by memset in bpf_obj_init. This might not be a valid initial state for some field types, in whic
bpf: Centralize btf_field-specific initialization logic
All btf_fields in an object are 0-initialized by memset in bpf_obj_init. This might not be a valid initial state for some field types, in which case kfuncs that use the type will properly initialize their input if it's been 0-initialized. Some BPF graph collection types and kfuncs do this: bpf_list_{head,node} and bpf_rb_node.
An earlier patch in this series added the bpf_refcount field, for which the 0 state indicates that the refcounted object should be free'd. bpf_obj_init treats this field specially, setting refcount to 1 instead of relying on scattered "refcount is 0? Must have just been initialized, let's set to 1" logic in kfuncs.
This patch extends this treatment to list and rbtree field types, allowing most scattered initialization logic in kfuncs to be removed.
Note that bpf_{list_head,rb_root} may be inside a BPF map, in which case they'll be 0-initialized without passing through the newly-added logic, so scattered initialization logic must remain for these collection root types.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230415201811.343116-9-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
#
1512217c |
| 15-Apr-2023 |
Dave Marchevsky <davemarchevsky@fb.com> |
bpf: Support refcounted local kptrs in existing semantics
A local kptr is considered 'refcounted' when it is of a type that has a bpf_refcount field. When such a kptr is created, its refcount should
bpf: Support refcounted local kptrs in existing semantics
A local kptr is considered 'refcounted' when it is of a type that has a bpf_refcount field. When such a kptr is created, its refcount should be initialized to 1; when destroyed, the object should be free'd only if a refcount decr results in 0 refcount.
Existing logic always frees the underlying memory when destroying a local kptr, and 0-initializes all btf_record fields. This patch adds checks for "is local kptr refcounted?" and new logic for that case in the appropriate places.
This patch focuses on changing existing semantics and thus conspicuously does _not_ provide a way for BPF programs in increment refcount. That follows later in the series.
__bpf_obj_drop_impl is modified to do the right thing when it sees a refcounted type. Container types for graph nodes (list, tree, stashed in map) are migrated to use __bpf_obj_drop_impl as a destructor for their nodes instead of each having custom destruction code in their _free paths. Now that "drop" isn't a synonym for "free" when the type is refcounted it makes sense to centralize this logic.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230415201811.343116-4-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
#
d54730b5 |
| 15-Apr-2023 |
Dave Marchevsky <davemarchevsky@fb.com> |
bpf: Introduce opaque bpf_refcount struct and add btf_record plumbing
A 'struct bpf_refcount' is added to the set of opaque uapi/bpf.h types meant for use in BPF programs. Similarly to other opaque
bpf: Introduce opaque bpf_refcount struct and add btf_record plumbing
A 'struct bpf_refcount' is added to the set of opaque uapi/bpf.h types meant for use in BPF programs. Similarly to other opaque types like bpf_spin_lock and bpf_rbtree_node, the verifier needs to know where in user-defined struct types a bpf_refcount can be located, so necessary btf_record plumbing is added to enable this. bpf_refcount is sized to hold a refcount_t.
Similarly to bpf_spin_lock, the offset of a bpf_refcount is cached in btf_record as refcount_off in addition to being in the field array. Caching refcount_off makes sense for this field because further patches in the series will modify functions that take local kptrs (e.g. bpf_obj_drop) to change their behavior if the type they're operating on is refcounted. So enabling fast "is this type refcounted?" checks is desirable.
No such verifier behavior changes are introduced in this patch, just logic to recognize 'struct bpf_refcount' in btf_record.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230415201811.343116-3-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
#
cd2a8079 |
| 15-Apr-2023 |
Dave Marchevsky <davemarchevsky@fb.com> |
bpf: Remove btf_field_offs, use btf_record's fields instead
The btf_field_offs struct contains (offset, size) for btf_record fields, sorted by offset. btf_field_offs is always used in conjunction wi
bpf: Remove btf_field_offs, use btf_record's fields instead
The btf_field_offs struct contains (offset, size) for btf_record fields, sorted by offset. btf_field_offs is always used in conjunction with btf_record, which has btf_field 'fields' array with (offset, type), the latter of which btf_field_offs' size is derived from via btf_field_type_size.
This patch adds a size field to struct btf_field and sorts btf_record's fields by offset, making it possible to get rid of btf_field_offs. Less data duplication and less code complexity results.
Since btf_field_offs' lifetime closely followed the btf_record used to populate it, most complexity wins are from removal of initialization code like:
if (btf_record_successfully_initialized) { foffs = btf_parse_field_offs(rec); if (IS_ERR_OR_NULL(foffs)) // free the btf_record and return err }
Other changes in this patch are pretty mechanical:
* foffs->field_off[i] -> rec->fields[i].offset * foffs->field_sz[i] -> rec->fields[i].size * Sort rec->fields in btf_parse_fields before returning * It's possible that this is necessary independently of other changes in this patch. btf_record_find in syscall.c expects btf_record's fields to be sorted by offset, yet there's no explicit sorting of them before this patch, record's fields are populated in the order they're read from BTF struct definition. BTF docs don't say anything about the sortedness of struct fields. * All functions taking struct btf_field_offs * input now instead take struct btf_record *. All callsites of these functions already have access to the correct btf_record.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230415201811.343116-2-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
Revision tags: v6.1.24 |
|
#
1cf3bfc6 |
| 12-Apr-2023 |
Ilya Leoshkevich <iii@linux.ibm.com> |
bpf: Support 64-bit pointers to kfuncs
test_ksyms_module fails to emit a kfunc call targeting a module on s390x, because the verifier stores the difference between kfunc address and __bpf_call_base
bpf: Support 64-bit pointers to kfuncs
test_ksyms_module fails to emit a kfunc call targeting a module on s390x, because the verifier stores the difference between kfunc address and __bpf_call_base in bpf_insn.imm, which is s32, and modules are roughly (1 << 42) bytes away from the kernel on s390x.
Fix by keeping BTF id in bpf_insn.imm for BPF_PSEUDO_KFUNC_CALLs, and storing the absolute address in bpf_kfunc_desc.
Introduce bpf_jit_supports_far_kfunc_call() in order to limit this new behavior to the s390x JIT. Otherwise other JITs need to be modified, which is not desired.
Introduce bpf_get_kfunc_addr() instead of exposing both find_kfunc_desc() and struct bpf_kfunc_desc.
In addition to sorting kfuncs by imm, also sort them by offset, in order to handle conflicting imms from different modules. Do this on all architectures in order to simplify code.
Factor out resolving specialized kfuncs (XPD and dynptr) from fixup_kfunc_call(). This was required in the first place, because fixup_kfunc_call() uses find_kfunc_desc(), which returns a const pointer, so it's not possible to modify kfunc addr without stripping const, which is not nice. It also removes repetition of code like:
if (bpf_jit_supports_far_kfunc_call()) desc->addr = func; else insn->imm = BPF_CALL_IMM(func);
and separates kfunc_desc_tab fixups from kfunc_call fixups.
Suggested-by: Jiri Olsa <olsajiri@gmail.com> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20230412230632.885985-1-iii@linux.ibm.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|